[Question posted by a user on YugabyteDB Community Slack]
Questions about row-level geolocation partition. Say I have two partitions in USA and one in Singapore. If a user who daily access via USA flys to Singapore. My question is will the user's reads and writes be slower because most of the data lives in USA? At the same time if the user flies back to USA and is doing read and update to rows that were created in Singapore, will it be slower?
Note that row-level geo partition means that you decide on a per-row basis, where it will be located.
In this case, if you want the rows to reside in USA, it will involve multi-region latency to query them from Singapore.
At the same time if the user flies back to USA and is doing read and
update to rows that were created in Singapore, will it be slower?
Assuming the rows were being inserted in USA all along, it will be fast, since it's in the same continent.
Related
I am going to design a Cassandra cluster for telecom domain with 7 nodes and data volume 30 TB on 45 days retention. Application layer will generate unique transaction id for each transaction which is a combination of mobile number and date-time. Customer can ask for all details of a specific mobile number for a particular day/range of dates,All transactions for a day and from these details, they can go for all details extraction for a particular Transaction id.
Will it be a good idea to create a single table keeping transaction id as primary key and other details as non key column? It may need 22*10^9 unique partitions. Any practical example of so large number of partitions? secondary index needed for 1st 2 types of queries
Will it be a better idea to create different tables? One with primary key (mobile number as partition and date as cluster) and other with transaction id as primary key. Storage requirements will be more.
Would materialised view help here?
kindly suggest any other idea for best performance.
[Question posted by a user on YugabyteDB Community Slack]
I want to someone remove my confusion, please correct me If I am wrong:
I have 3 nodes (3 tables)
Table structure:
ID (Hash of Account/Site/TS)
Account
Site
Timestamp
I have pattern of accounts inside multiple sites. Should I partition by account is it better by site? (Small partition size is better / Large partition size is better).
Read happens by all three columns. Which is a better choice of partition?
YugabyteDB doesn't need declarative partitioning to distributed data (this is done by sharding on the key). Partitioning is used to isolate data (like cold data to archive later, or like geo-distribution placement).
If you define PRIMARY KEY( ( Account, Site, Timestamp ) HASH ) you will have the distribution (YugabyteDB uses a hash function on the columns to distribute to tablets) and no skew (because the timestamp is in it). Sharding is automatic. You don't have to define an additional column for that https://docs.yugabyte.com/preview/architecture/docdb-sharding/sharding/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have read all documents available on Microsoft websites and the internet but most of them talk about large data but my requirement is quite small.
I am trying to save Customer Onboarding data. Before Customer onboards we assign him his company Id and User Id and admin role and default environment. The company can create multiple dummy environments to test. E.g. Dev1, Stage And Test123, etc, and Onboarding will be done on Environment Level.
Onboarding JSON
{
"companyId": "Company123",
"environment": "stg1",
"userId": "User123",
"startDate": 1212121212,
"modifiedDate": 1212121212,
"uniqueId": "<companyId_UserId>"
}
Onboarding can be done at Environment Level. As per data a Company can have at most 10 to 15 environments. In the above document User Id is just metadata to check which user started onboarding on Environment stg1.
Initially I thought of using the company Id as partition key but in this case each logical partition will have at most 15 records.
My Cosmos Queries will contain Company Id & Environment Id as a filter.
Is it a good approach? Or Should I generate synthetic Partition Key using Hash Function and limit logical partitions to 10 or 20.
Which one is faster?
A large number of Logical Partitions but all partitions contains 10 to 15 Documents
A Less number of Logical Partitions but partitions contains more number of Documents.
My complete data size is about < 1 GB so please don't assume that we will reach the limit of "logical partition limit 10 GB" here.
My other Queries is
With Azure SDK In the case of inserting new document my RU is 7.67 but in the case of upsert it is 10.9. Is there any way to reduce this.
If your collection is never going to go over 20GB then what you use as a partition key is not as critical because all of your data (and your queries) will reside on a single physical partition. Partition keys (and partitioning) are all about scale (which is why we always talk about them in the context of large amounts of data or high volume of operations).
In a read-heavy workload, choosing a partition key that is used in all of your query where clauses is a safe strategy, in your case a synthetic key of environmentId-companyId is a good choice. If this is a write heavy workload then you also want the partition key values to distriubte writes across partitions. But again, if this is a small collection then this matters little here.
Your id property is fine as it will work having the same companyId-userId value with different partition key values which is what I assume you want. You also can do a point read with environmentId, companyId and userId if you have all three which you should do as much as possible rather than queries when looking for a single item. Even though this collection will not grow, based upon what you say, the partition strategy here should allow it to scale should you ever want it to.
Upserts are always going to be more expensive than an insert because it's two operations rather than one. The only way to reduce the cost of writes is to create a custom index policy and exclude paths you never query on. But based upon the example document in your post, a custom index policy will not get you any improvement.
Hope this is helpful.
Logical partition limit is not 20gb, as far as I'm aware. As far as I know from the talks with the product group developing cosmos db there is no harm in creating as many partitions as you need, just keep in mind you should avoid cross-partition queries at all costs (so design the data in such a fashion that you will never have to do cross partition queries).
so logical partition for a customer makes sense, unless you want to do queries across all customers. but given the data set size it should not have a tremendous impact. either way, both approaches will work. I'd say creating a synthetic key is only needed when you cannot find a reasonable key without generating it
I have 1000 partitions per table and cust_id is partition key and bucket_id and timestamp are the cluster keys.
Every hour one bucket_id and timestamp entry are recorded per cust_id.
Each day 24 * 1 = 24 rows will be recorded per partiton.
One year approx 9000 records per partion.
Partion size is 4MB approx.
---> 20 nodes Cassandra cluster single DC and RF=3
I want to select random five buckets for last 90 days data using IN query.
select cust_id,bucket_id,timestamp from customer_data where
cust_id='tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' and
bucket_id IN (0,2,5,7,8)
and timestamp >='2020-03-01 00:00:00' and
timestamp <='2020-06-01 00:00:00';
Please confirm, does this approach cause any issues with coordinator pressure and query timeouts?
How much data can a coordinator bear and return data without any issue?
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
If I run same kind of query for 10 Mil customers, does this affect coordinator pressure? Does it increase the chances to get a read timeout error?
It's could be hard to get definitive yes/no answer to these questions - there are some unknowns in them. For example, what version of Cassandra, how much memory is allocated for instance, what disks are used for data, what compaction strategy is used for a table, what consistency level do you use for reading the data, etc.
Overall, on the recent versions of Cassandra and when using SSDs, I won't expect problems with that, until you have hundreds of items in the IN list, especially if you're using consistency level LOCAL_ONE and prepared queries - all drivers use token-aware load balancing policy by default, and will route request to the node that holds the data, so it will be both coordinator & data node. Use of other consistency levels would put more pressure to the coordinating node, but it still should work quite good. The problem with read timeouts could start if you use HDDs, and overall incorrectly size the cluster.
Regarding the 10Mil customers - in your query you're doing select by partition key, so query is usually sent to a replica directly (if you use prepared statements). To avoid problems you shouldn't do IN for partition key column (cust_id in your case) - if you do queries for individual customers, driver will spread queries over the whole cluster & you'll avoid increased pressure on coordinator nodes.
But as usual, you need to test your table schema & cluster setup to prove this. I would recommend to use NoSQLBench - benchmark/load testing tool that was recently open sourced by DataStax - it was built for quick load testing of cluster and checking data models, and incorporates a lot of knowledge in area of performance testing.
Please try to ask one question per question.
Regarding the how much a coordinator node can handle, Alex is correct in that there are several factors which contribute to that.
Size of the result set.
Heap/RAM available on the coordinator node.
Network consistency between nodes.
Storage config (spinning, SSD, NFS, etc).
Coordinator pressure will vary widely based on these parameters. My advice, is to leave all timeout threshold settings at their defaults. They are there to protect your nodes from becoming overwhelmed. Timeouts are Cassandra's way of helping you figure out how much it can handle.
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
Based on your description, the primary key definition should look like this:
PRIMARY KEY ((cust_id),bucket_id,timestamp)
The data will be stored on disk by partition, and sorted by the cluster keys, similar to this (assuming ascending order on bucket_id and descending order on timestamp:
cust_id bucket_id timestamp
'tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' 0 2020-03-02 04:00:00
2020-03-01 22:00:00
1 2020-03-27 16:00:00
2 2020-04-22 05:00:00
2020-04-01 17:00:00
2020-03-05 22:00:00
3 2020-04-27 19:00:00
4 2020-03-27 17:00:00
5 2020-04-12 08:00:00
2020-04-01 12:00:00
Cassandra reads through the SSTable files in that order. It's important to remember that Cassandra reads sequentially off disk. When queries force it to perform random reads, that's when things may start to get a little slow. The read path has structures like partition offsets and bloom filters which help it figure out which files (and where inside them) have the data. But within a partition, it will need to scan clustering keys and figure out what to skip and what to return.
Depending on how many updates these rows have taken, it's important to remember that the requested data may stretch across multiple files. Reading one file is faster than reading more than one.
At the very least, you're forcing it to stay on one node by specifying the partition key. But you'll have to test how much a coordinator can return before causing problems. In general, I wouldn't specify double digits of items in an IN clause.
In terms of optimizing file access, Jon Haddad (now of Apple) has a great article on this: Apache Cassandra Performance Tuning - Compression with Mixed Workloads It focuses mainly on the table compression settings (namely chunk_length_in_kb) and has some great tips on how to improve data access performance. Specifically, the section "How Data is Read" is of particular interest:
We pull chunks out of SSTables, decompress them, and return them to the client....During the read path, the entire chunk must be read and decompressed. We’re not able to selectively read only the bytes we need. The impact of this is that if we are using 4K chunks, we can get away with only reading 4K off disk. If we use 256KB chunks, we have to read the entire 256K.
The point of this ^ relevant to your question, is that by skipping around (using IN) the coordinator will likely read data that it won't be returning.
I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql