Large number of Cassandra partition required - cassandra

I am going to design a Cassandra cluster for telecom domain with 7 nodes and data volume 30 TB on 45 days retention. Application layer will generate unique transaction id for each transaction which is a combination of mobile number and date-time. Customer can ask for all details of a specific mobile number for a particular day/range of dates,All transactions for a day and from these details, they can go for all details extraction for a particular Transaction id.
Will it be a good idea to create a single table keeping transaction id as primary key and other details as non key column? It may need 22*10^9 unique partitions. Any practical example of so large number of partitions? secondary index needed for 1st 2 types of queries
Will it be a better idea to create different tables? One with primary key (mobile number as partition and date as cluster) and other with transaction id as primary key. Storage requirements will be more.
Would materialised view help here?
kindly suggest any other idea for best performance.

Related

How to select Partition Key in Azure Cosmos in case volume is very low ( total records < 50k) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have read all documents available on Microsoft websites and the internet but most of them talk about large data but my requirement is quite small.
I am trying to save Customer Onboarding data. Before Customer onboards we assign him his company Id and User Id and admin role and default environment. The company can create multiple dummy environments to test. E.g. Dev1, Stage And Test123, etc, and Onboarding will be done on Environment Level.
Onboarding JSON
{
"companyId": "Company123",
"environment": "stg1",
"userId": "User123",
"startDate": 1212121212,
"modifiedDate": 1212121212,
"uniqueId": "<companyId_UserId>"
}
Onboarding can be done at Environment Level. As per data a Company can have at most 10 to 15 environments. In the above document User Id is just metadata to check which user started onboarding on Environment stg1.
Initially I thought of using the company Id as partition key but in this case each logical partition will have at most 15 records.
My Cosmos Queries will contain Company Id & Environment Id as a filter.
Is it a good approach? Or Should I generate synthetic Partition Key using Hash Function and limit logical partitions to 10 or 20.
Which one is faster?
A large number of Logical Partitions but all partitions contains 10 to 15 Documents
A Less number of Logical Partitions but partitions contains more number of Documents.
My complete data size is about < 1 GB so please don't assume that we will reach the limit of "logical partition limit 10 GB" here.
My other Queries is
With Azure SDK In the case of inserting new document my RU is 7.67 but in the case of upsert it is 10.9. Is there any way to reduce this.
If your collection is never going to go over 20GB then what you use as a partition key is not as critical because all of your data (and your queries) will reside on a single physical partition. Partition keys (and partitioning) are all about scale (which is why we always talk about them in the context of large amounts of data or high volume of operations).
In a read-heavy workload, choosing a partition key that is used in all of your query where clauses is a safe strategy, in your case a synthetic key of environmentId-companyId is a good choice. If this is a write heavy workload then you also want the partition key values to distriubte writes across partitions. But again, if this is a small collection then this matters little here.
Your id property is fine as it will work having the same companyId-userId value with different partition key values which is what I assume you want. You also can do a point read with environmentId, companyId and userId if you have all three which you should do as much as possible rather than queries when looking for a single item. Even though this collection will not grow, based upon what you say, the partition strategy here should allow it to scale should you ever want it to.
Upserts are always going to be more expensive than an insert because it's two operations rather than one. The only way to reduce the cost of writes is to create a custom index policy and exclude paths you never query on. But based upon the example document in your post, a custom index policy will not get you any improvement.
Hope this is helpful.
Logical partition limit is not 20gb, as far as I'm aware. As far as I know from the talks with the product group developing cosmos db there is no harm in creating as many partitions as you need, just keep in mind you should avoid cross-partition queries at all costs (so design the data in such a fashion that you will never have to do cross partition queries).
so logical partition for a customer makes sense, unless you want to do queries across all customers. but given the data set size it should not have a tremendous impact. either way, both approaches will work. I'd say creating a synthetic key is only needed when you cannot find a reasonable key without generating it

How to decide a good partition key for Azure Cosmos DB

I'm new to Azure Cosmos DB, but I want to have a vivid understanding of:
What is the partition key?
My understanding is shallow for now -> items with the same partition key will go to the same partition for storage, which could better load balancing when the system grows bigger.
How to decide on a good partition key?
Could somebody please provide an example?
Thanks a lot!
You have to choose your partition based on your workload. They can be classified into two.
Read Heavy
Write Heavy
Read heavy workloads are where the data is read more than it has been written, like the product catalog, where the insert/update frequency of the catalogs is less, and people browsing the product is more.
Write Heavy workloads are the ones where the data is written more than it is read. Common scenarios are IoT devices sending multiple data from multiple sensors. You will be writing lots of data to Cosmos DB because you may get data every second.
For read-heavy workload choose the partition key, where the property is used in the filter query. The product example will be the product id, which will be used mostly to fetch the data when the user wants to read the information and browse its reviews.
For Write-heavy workload choose the partition key, where the property is more unique. For example, in the IoT Scenario, use the partition key such as deviceid_signaldatetime, which is concatenating the device-id that sends the signal, and DateTime of the signal has more uniqueness.
1.What is the partition key?
In azure cosmos db , there are two partitions: physical partition and logical partition
A.Physical partition is a fixed amount of reserved SSD-backed storage combined with variable amount of compute resources.
B.Logical partition is a partition within a physical partition that stores all the data associated with a single partition key value.
I think the partiton key you mentioned is the logical partition key.The partition key acts as a logical partition for your data and provides Azure Cosmos DB with a natural boundary for distributing data across physical partitions.More details, you could refer to How does partitioning work.
2.How to decide a good partition key? Could somebody please provide an example?
You need consider to pick a property name that has a wide range of values and has even access patterns.An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
For example, your data has fields named id and color and you query the color as filter more frequently.You need to pick the color not id for partition key which is more efficient for your query performance. Because every item has different id but maybe has same color.It has wide range. Also if you add a color,the partition key is scalable.
More details ,please read the Partition and scale in Azure Cosmos DB.
Hope it helps you.

Creating unique partition keys can't be used afterwards for other data

I'm new to AWS dynamoDB. In my research, I encountered a scenario, "Think of it like a bank with lines in front of teller windows. If everybody lines up at one teller, less customers can be served. It is more efficient to distribute customers across many different teller windows. A good partition key for distributing customers might be the customer number since it is different for each customer."
I have a question, how to find out the customer numbers encountered by each teller with the same table (customer number as partition key).
Think of it like a bank with lines in front of teller windows. If
everybody lines up at one teller, less customers can be served. It is
more efficient to distribute customers across many different teller
windows. A good partition key for distributing customers might be the
customer number since it is different for each customer.
What explains by the above sentence is regarding how DynamoDB storage distribution happens and how it affect querying of data. Assume a partition key as a separate database server. When you have only a single partition key, all the queries goes to that server increasing its utilization, limiting to the single server for throughput. If you have multiple partition keys, internally DynamoDB can find the items in parallel from multiple servers without hitting a single partition server bottleneck.
Customer numbers are the partition key, and tellers are row data So do
I have to run the query on every customer number to find data of a
particular teller?
If you store the data in DynamoDB table called Customers, only with the Customer number as the primary key, to find a particular teller, you need to scan the entire DynamoDB table which is highly inefficient.
If you only want to get a particular teller item queried directly.
If you want to query the teller information directly, only using the teller Id, create the Teller id as a Global Secondary index and query the index to find the Teller information.
If your query involves a given a Customer number and Teller id to find the Teller information, you can re-create the table having Teller id as a sort key (If it makes sense to your data model) so that you can directly query the Teller information for a particular customer.

Cassandra partition keys organisation

I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Resources