Querying one record from tens of millions of records in Azure Table Storage - azure

I have a typical scenario where a consumer is calling a Azure Function (EP1) (synchronously) which then queries Azure Table storage (having 5 million records), based upon the input parameters of the Azure Function API.
Azure Table Storage has following columns:
Order Number (incremental number)
IsConfirmed (can have value Y or N)
Type of Order (can be of 6 types maximum)
Order Date
Order Details
UUID
Now when consumer queries, it generally searches with the Order Number and expects the Order Date and Order Details in response, along with Order Number.
For this, we had chosen:
Partition Key: IsConfirmed + Type of Order
Row Key: UUID
Now for 5 million records search, because of the partition key type, the search partition often runs into more than 3 million records (maximum orders have IsConfirmed as Y and Type of Order a specific one among the six types) and the Table query takes more than 5 minutes.
As a result, the consumer generally times out as the wait configured on consumer side is 60 secs.
So looking for recommendation on how to do this efficiently.
Can we choose partition key as Order Number (but that will create 5 million partitions) or a combination of Order NUmber+IsConfirmed+TypeofOrder?
Ours is a write heavy Java application and READ happens much less.
+++++++++++ UPDATE +++++++++++++++
As suggested by Gaurav in the answer, after making orderid as partition key, the query is working as expected.
Now that brings to the next problem - we do have other API queries where the order data and type are only used as input search criteria.
Since this doesn't match with the partition key, so in this 2nd type of query, its basically making a whole scan and the consumer is again timed out again.
So what should be the design to handle these types of queries.. Azure doc says creating a separate table where order type + order date becomes partition key. However that will mean that whenever we are writing to the table, we will have to write on both tables (one with orderid as part key and other as order date + type as part key).

Can we choose partition key as Order Number (but that will create 5
million partitions) or a combination of Order
NUmber+IsConfirmed+TypeofOrder?
You can certainly choose partition key as order number as there is nothing wrong in having large number of partitions. However, please keep in mind that partition key value is of string type. What you may want to do is pad your order number with some character (say 0) so that all of your orders are of the same length.
In this case, I would actually recommend that you keep the row key as empty.
You may also want to think about storing multiple copies of the same data with different partition key/row key combination depending on your querying requirements. For example, if you were to query by order date, you may want to make another copy of the data with order date as the partition key.
Generally speaking it is recommended that you do point queries (query including both partition key and row key). Next best option would be to query by partition key (you would want to keep data in partition key small so that you're not doing partition scans). All other options would result in full table scan which is not at all recommended.
You may find this link useful: https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-guidelines.

Related

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

What is the cardinality of a partition key?

If I use a randomly generated unique Id , is it correct that
the cardinality would be rather large ?
If I have a key with a low cardinality like 5 category values that the partition key can take, and I want to distribute it, the recommended approach seems to be to make partition key into composite key.
But this requires that I have to specify all the parts of a composite key in my query to retrieve all records of that key.
Even then the generated token might end up being for the same node.
Is there any way to decide on a the additional column for composite key to that would guarantee that the data would be distributed ?
The thing is that with cassandra you actually want to have partitioning keys "known" so that you can access the data when you need it. I'm not sure what you mean when you say large cardinality on partitioning key. You would get a lot of partitions in the cluster. This is usually o.k.
If you want to distribute the data around the cluster. You can use artificial columns. And this approach is sometimes also called bucketing. Basically if you want to keep 100k+ or in never version 1 million+ columns it's o.k. to split this data into partitions.
Some people simply use a trick and when they insert the data they add some artificial bucket column to partition ... let's say random(1-10) and then when they are reading the data out they simply issue 10 queries or use an in operator and then fetch the data and merge it on the client side. This approach has many benefits in that it prevents appearance of "hot rows" in the cluster.
Chances for every key are more or less 1/NUM_NODES that it will end on the same node. So I would say most of the time this is not something you should worry about too much. Unless you have number of partitions that is smaller then the number of nodes in the cluster.
Basically there are two choices for additional column random (already described) or some function based on some input data i.e. when using time series data and you decide to bucket based on the month you can always calculate the month based on the data that you are going to insert and then you just put it in bucket. When you are retrieving the data then you know ... o.k. I'm looking something in May 2016 and then you know how to select the appropriate bucket.

Need recommendation on appropriate primary key structure

I have a lot of time series data that I would like to store in a Cassandra database. Since I can only do WHERE clauses on fields in the primary key, I need some recommendations on how to lay this out based on the way that I will need to query it.
My data is in this format:
SYSTEM_SERIAL_NUMBER,DEVICE_ID,TIMESTAMP,...OTHER COLUMNS
Each serial number has multiple devices, and I will have thousands of timestamps for every device, so my primary key to uniquely identify each set of data has to include all three.
There are basically two types of queries I will do on this data.
SELECT * FROM TABLE WHERE system_serial_number = 'X' and device_id = 'x' and timestamp (is in a range)
or
SELECT * FROM TABLE WHERE system_serial_number = 'X' and timestamp (is in a range)
The second one is the more likely query, because I am typically going to input a time range in the application and I want to see data from every single device for a given serial number. But I can't leave the device name out of the key because you need serial/device/timestamp to be able to uniquely identify an entire row.
I've tried to create my tables as follows:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY ((system_serial_number,device_id),time_stamp)
);
And also as:
CREATE TABLE devices (
system_serial_number text,
device_id int,
time_stamp timestamp,
...,
PRIMARY KEY (system_serial_number,device_id,time_stamp)
);
The first one I think would keep me from hitting column limitations, but it always requires me to enter a Device ID along with the Serial every time I query. The second one is less column efficient (based on my understanding), and it allows me to search by serial only. Neither one of them lets me search by just serial/timestamp, which is actually the most common search that I am going to do, but isn't unique enough to be a primary key.
The only way I've even been able to get a query to work is by using the first one with the compound key and then adding a secondary index for just serial number, which then allows me to search by serial/timestamp, but I have to use the inefficient ALLOW FILTERING.
Any suggestions on the best way to get what I need?
The simplest answer is:
PRIMARY KEY (system_serial_number, time_stamp, device_id)
system_serial_number will be the partition key that identifies which replicas (nodes) will contain the data. All data for a single serial number will need to fit in the same partition. For efficient access, all queries will be required to specify a serial number. If partition size is a concern, there may be ways to further subdivide if the use case allows.
time_stamp will be the clustering key used to sort the rows within the partition. That is, all logical rows for the same serial number will be ordered by the timestamp, irrespective of the device. The first PK column that is not a part of the partition key determines the sort order.
device_id is an additional PK column to distinguish your logical rows, but does not help you sort or do other range scans.
Since you mentioned that each device would generate thousands of timestamps, and each serial number will have many devices, you may also need to be concerned about the size of your partitions if you take the above approach. A common approach is to break the data for a single serial number across multiple partitions, but that can make querying your data either more efficient or more troublesome, depending on how you decide to subdivide the data.
You will have to use some imagination and knowledge of your specific use cases to decide on the proper partitioning layout. Off the top of my head, I can think of some ideas:
PRIMARY KEY ((system_serial_number, device_hash_modulus), time_stamp, device_id)
Idea: hash your device IDs and apply a modulus to split the data across a fixed number of "buckets"
Advantage: with an even hash distribution, spreads data evenly across a known number of nodes
Disadvantage: querying across "all devices" for a given serial number requires making N queries, one for each "bucket" based on the number chosen for the modulo operation
Disadvantage: may need to adjust bucketing scheme (and migrate data) if initial choice is too small for eventual data size
PRIMARY KEY ((system_serial_number, coarse_time_stamp), time_stamp, device_id)
Idea: split the data over time into different partitions, size determined by how coarse you make the partitioning timestamp (year? year+month?, year+day?, etc.). The decision should be made based on how many unique records are expected within a given time period.
Advantage: assuming the cluster is configured with a random partitioner, the data will be evenly distributed around the cluster as time moves forward.
Disadvantage: querying for records across a range of time may involve making separate queries to different partitions, making the program logic more complex. If the partition timestamp isn't coarse enough, or the timestamp range to be searched is too wide, performance will be impacted.
There may be other options available to you, but it will all depend on how well you understand your current use cases (and how well you can predict the future behavior of your data set).

Windows Azure table access latency Partition keys and row keys selection

We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.

Azure - Querying 200 million entities

I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?
Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.
The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.

Resources