Cosmos DB Graph Edge partitioning - azure

Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following:
We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents.
Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships.
An example of one of these relationships is a Follow. For instance UserX can Follow UserY. To realize this relationship, we created a Gremlin query that creates an Edge:
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}')
.addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}'))
The resulting Edge automatically gets assigned to the partition of UserX, because UserX is the out-vertex.
When querying on outgoing edges (all the users that UserX is following), all is fine and well because the query is limited to the partition for UserX.
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV()
However when inverting the query (find all followers of UserY), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query:
g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV()
In my opinion a full cross-partition query with 100 million partitions is unacceptable.
I have tried putting the Edge between UserX and UserY inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API)
Now I have come to the point of implementing a pair of edges between UserX and UserY, one outgoing Edge for UserX and one outgoing Edge for UserY, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency.
Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?

I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows.
In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.

This is a classic partitioning dilemma, not unique to Cosmos/Graph.
If your usage pattern is lots of queries with small scope then cross-partition is bad. If it is returning large data sets then cross-partition overhead is probably insignificant against the benefits of parallelism. Unless you have a constant high volume of queries then I think the cross-partition overhead is overstated (MS seem to think everyone is building the next Facebook on Cosmos).
In the OP case you can optimise for x follows y, or x is followed by y, or both by having an edge each way. Note that RUs are reserved on a per partition basis (i.e. total RU / number of partitions) so to use them efficiently you need either high volume, evenly distributed, single partition queries or queries that span multiple partitions.

Related

How does Apache Cassandra perform on a single read of millions of records?

Much has been written about how Cassandra's redundancy provides good performance for thousands of incoming requests from different locations, but I haven't found anything on the throughput of a single big request. That's what this question is about.
I am assessing Apache Cassandra's potential as a database solution to the following problem:
The client would be a single-server application with exclusive access to the Cassandra database, co-located in the same datacentre. The Cassandra instance might be a few nodes, but likely not more than 5.
When a certain feature runs on the application (triggered occasionally by a human) it will populate Cassandra with up to 5M records representing short arrays of float data, as well as delete such records. The records will not be updated and we never need to access individual elements of an array. The arrays can be of different lengths, but will typically have around 100 elements, and each row might represent 0-20 arrays.
For example:
id array1 array2
123 [1.0, 2.5, ..., 10.8] [0.0, 0.5, ..., 1.0]
Bonus question: Should I use a list of doubles to represent this, or should I serialize the arrays to Json?
At some point the user requests a report and the server should read all 5M records, interpret the arrays, do some aggregation, and plot some data on the screen. Might the read operation take <1s, <10s, <100s? How can I estimate the throughput in this case, assuming it is the bottleneck?
Let me start with your second use case, As your data is distributed across the nodes if you have a broad range query without having a narrowed down partition, Cassandra is going to perform slow.
Cassandra is well suitable for Querying and suitable for searching if
you know the partition key.
Even you are having a 5M records, Assuming this gets scattered around
5 different nodes, For your reporting use case Cassandra has to go
through all the nodes and aggregate it. Eventually it gets timed out.
This specific use case is not viable in Cassandra but if you can
aggregate in your service and make multiple calls to partition and
bucket. it is going to perform super fast.
Generally, the accessing pattern matters, Read wins. The data can be
formatted in any form but reading it wisely is matters to Cassandra.
So answered your second part. Thank you.

Cosmos DB partition key and query design for sequential access

We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.

Azure Cosmos DB - Understanding Partition Key

I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from #Porschiey.
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell:
The PartitionKey is a property that will exist on every single object that is best used to group similar objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
I've put together a detailed article here Azure Cosmos DB. Partitioning.
What's logical partition?
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Max throughput: 10k RUs
Max data size (sum of sizes of all LPs stored in this PP): 50GB
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
How to select right partition key for my container?
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Why large partitions are a terrible choice in most cases even though documentation says "select whatever works best for you"
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP].
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
You've created container with provisioned 10k RUs and CustomerId partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
Gradually adding data to container you end-up with 3 big customers with C1[10GB], C2[20GB] and C3[10GB] of invoices
When another customer was onboarded to the system with C4[15GB] of data Cosmos DB will have to split PP1 data into two newly created PP2 (30GB) and PP3 (25GB). Maximum throughput per PP is 10k/2 = 5k RUs
Two more customers C5[10GB] C6[15GB] were added to the system and both ended-up in PP2 which lead to another split -> PP4 (20GB) and PP5 (35GB). Maximum throughput per PP is now 10k/3 = 3.333k RUs
IMPORTANT: As a result on [Day 2] C1 data was queried with up to 10k RUs
but on [Day 4] with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data
Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
You are defining a logical partition.
Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries

Azure Table Storage Performace For a Very Specific Table

Is there a way around 500 entities / second / partition with ATS (Azure Table Storage)? OK with dirty reads. If in insert is not immediately available for read then OK.
Looking to move some large tables from SQL to ATS.
Scale: Because of these tables the size is bumping the 150 GB limit of SQL Azure
Insert speed:  Inverted index for query speed.  Insert order is not
sorted by the table clustered index which causes rapid SQL table
fragmentation.  ATS most likely has an insert advantage over SQL.
Cost: ATS has a lower monthy cost. But ATS has a higher load cost as millions of rows and cannot batch as the order of the load is not by partition.
Query speed: A search is almost never on just one partitionKey. A search will have a SQL component and zero or more ATS components. This ATS query is always by partitionKey and returning rowKeys. Raw search on partitionKey is fast the problem is the time to return the entities (rows). A given partitionKey will have on average 1,000 rowKeys which is 2 seconds at 500 entities / second / partition. But there will be some partitionKeys with over 100,000 rowKeys which equates to over 3 minutes. Return 10,000 rows at a time and in SQL and no query is over 10 seconds as with the power of joins don't have to bring down 100,000 rows to have those rows considered in the where.
Is there a was around this select entity speed with ATS? For scale and insert speed would like to go to ATS.
Windows Azure Storage Abstractions and their Scalability Targets
How to get most out of Windows Azure Tables
Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Turn entity tracking off for query results that are not going to be modified:
context.MergeOption = MergeOption.NoTracking;
One potential workaround is to stripe the data across multiple partitions and/or tables, perform queries across all the (sub)partitions in parallel and merge the results.
For example, for striping across partitions, prepending the partition key with a single digit can multiple the scalability of the partition 10 times.
So a partition key, say ABCDEFGH, could be sub partitioned 0ABCDEFGH to 9ABCDEFGH.
Writes are made to a partition, with the prefix digit generated either randomly or in round robin fashion.
Reads would query across all 10 partitions in parallel and merge the results.
For striping across tables, one of N tables can be written to randomly or in round robin fashion and queried similarly in parallel.
Edit: I had originally stated that the limit was 500 transaction/partition/sec. That was incorrect. The limit is actually 500 entities/partition/sec, as stated in the original question.
This also applies to the query speeds you've calculated. If you query an ATS PartitionKey and it returns 1000 entities, that will likely take only a little longer, perhaps a few hundred milliseconds, than returning a single entity. On the other hand, if the query returns more than 1000 entities it will be much slower, as each set of 1000 rows requires an essentially independent transaction and must be done in serial.
It's not completely clear to me what you're doing, but it sounds like a lot of querying. Keep in mind that querying ATS on non-key columns tends to be very slow. If you're doing a lot of that, you might be better served by using SQL Azure Federations and fan-out queries instead.

Azure Table Storage - How fast can I table scan?

Everyone warns not to query against anything other than RowKey or PartitionKey in Azure Table Storage (ATS), lest you be forced to table scan. For a while, this has paralyzed me into trying to come up with exactly the right PK and RK and creating pseudo-secondary indexes in other tables when I needed to query something else.
However, it occurs to me that I would commonly table scan in SQL Server when I thought appropriate.
So the question becomes, how fast can I table scan an Azure Table. Is this a constant in entities/second or does it depend on record size, etc. Are there some rules of thumb as to how many records is too many to table scan if you want a responsive application?
The issue of a table scan has to do with crossing the partition boundaries. The level of performance you are guaranteed is explicity set at the partition level. therefore, when you run a full table scan, its a) not very efficient, b) doesn't have any guarantee of performance. This is because the partitions themselves are set on seperate storage nodes, and when you run a cross partition scan, you're consuming potentially massive amounts of resources (tieing up multiple nodes simultaneously).
I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost).
If the number of partitions/nodes you're crossing is fairly small, you likely won't notice any issues.
But please don't quote me on this. I'm not an expert on Azure Storage. Its actually the area of Azure I'm the least knowledgeable about. :P
I think Brent is 100% on the money, but if you still feel you want to try it, I can only suggest to run some tests to find out yourself. Try include the partitionKey in your queries to prevent crossing partitions because at the end of the day that's the performance killer.
Azure tables are not optimized for table scans. Scanning the table might be acceptable for a long-running background job, but I wouldn't do it when a quick response is needed. With a table of any reasonable size you will have to handle continuation tokens as the query reaches a partition boundary or obtains 1k results.
The Azure storage team has a great post which explains the scalability targets. The throughput target for a single table partition is 500 entities/sec. The overall target for a storage account is 5,000 transactions/sec.
The answer is Pagination. Use the top_size -- max number of results or records in result -- in conjunction with next_partition_key and next_row_key the continuation tokens. That makes a significant even factorial difference in performance. For one, your results are statistically more likely to come from a single partition. Plain results show that sets are grouped by the partition continuation key and not the row continue key.
In other words, you also need to think about your UI or system output. Don't bother returning more than 10 to 20 results max 50. The user probably wont utilize or examine any more.
Sounds foolish. Do a Google search for "dog" and notice that the search returns only 10 items. No more. The next records are avail for you if you bother to hit 'continue'. Research has proven that almost no user ventures beyond that first page.
the select (returning a subset of the key-values) may make a difference; for example, use select = "PartitionKey,RowKey" or 'Name' whatever minimum you need.
"I believe, that the effect of crossing these boundaries also results
in continuation tokens, which require additional round-trips to
storage to retrieve the results. This results then in reducing
performance, as well as an increase in transaction counts (and
subsequently cost)."
...is slightly incorrect. the continuation token is used not because of crossing boundaries but because azure tables permit no more than 1000 results; therefore the two continuation tokens are used for the next set. default top_size is essentially 1000.
For your viewing pleasure, here's the description for queries entities from the azure python api. others are much the same.
'''
Get entities in a table; includes the $filter and $select options.
table_name: Table to query.
filter:
Optional. Filter as described at
http://msdn.microsoft.com/en-us/library/windowsazure/dd894031.aspx
select: Optional. Property names to select from the entities.
top: Optional. Maximum number of entities to return.
next_partition_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextPartitionKey']
next_row_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextRowKey']
'''

Resources