Azure - Querying 200 million entities - azure

I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?

Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.

The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.

Related

Querying one record from tens of millions of records in Azure Table Storage

I have a typical scenario where a consumer is calling a Azure Function (EP1) (synchronously) which then queries Azure Table storage (having 5 million records), based upon the input parameters of the Azure Function API.
Azure Table Storage has following columns:
Order Number (incremental number)
IsConfirmed (can have value Y or N)
Type of Order (can be of 6 types maximum)
Order Date
Order Details
UUID
Now when consumer queries, it generally searches with the Order Number and expects the Order Date and Order Details in response, along with Order Number.
For this, we had chosen:
Partition Key: IsConfirmed + Type of Order
Row Key: UUID
Now for 5 million records search, because of the partition key type, the search partition often runs into more than 3 million records (maximum orders have IsConfirmed as Y and Type of Order a specific one among the six types) and the Table query takes more than 5 minutes.
As a result, the consumer generally times out as the wait configured on consumer side is 60 secs.
So looking for recommendation on how to do this efficiently.
Can we choose partition key as Order Number (but that will create 5 million partitions) or a combination of Order NUmber+IsConfirmed+TypeofOrder?
Ours is a write heavy Java application and READ happens much less.
+++++++++++ UPDATE +++++++++++++++
As suggested by Gaurav in the answer, after making orderid as partition key, the query is working as expected.
Now that brings to the next problem - we do have other API queries where the order data and type are only used as input search criteria.
Since this doesn't match with the partition key, so in this 2nd type of query, its basically making a whole scan and the consumer is again timed out again.
So what should be the design to handle these types of queries.. Azure doc says creating a separate table where order type + order date becomes partition key. However that will mean that whenever we are writing to the table, we will have to write on both tables (one with orderid as part key and other as order date + type as part key).
Can we choose partition key as Order Number (but that will create 5
million partitions) or a combination of Order
NUmber+IsConfirmed+TypeofOrder?
You can certainly choose partition key as order number as there is nothing wrong in having large number of partitions. However, please keep in mind that partition key value is of string type. What you may want to do is pad your order number with some character (say 0) so that all of your orders are of the same length.
In this case, I would actually recommend that you keep the row key as empty.
You may also want to think about storing multiple copies of the same data with different partition key/row key combination depending on your querying requirements. For example, if you were to query by order date, you may want to make another copy of the data with order date as the partition key.
Generally speaking it is recommended that you do point queries (query including both partition key and row key). Next best option would be to query by partition key (you would want to keep data in partition key small so that you're not doing partition scans). All other options would result in full table scan which is not at all recommended.
You may find this link useful: https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-guidelines.

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Windows Azure table access latency Partition keys and row keys selection

We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Resources