I have a question about a best-practice when working with the Azure Table service.
Imagine a table called Customers. Imagine several other tables, split into a vast amount of partitions. In these tables, there are CustomerName fields.
In the case that a Customer changes his name... Then I update the corresponding record in the Customers table. In contrast to a relational database, all the other columns in the other table are (obviously) not updated.
What is the best way to make sure that all the other tables are also updated? It seems extremely inefficient to me to query all tables on the CustomerName, and subsequently update all these records.
If you are storing the CustomerName multiple times across tables there is no magic about it, you will need to find those records and update the CustomerName field on them as well.
Since it is quite an inefficient operation, you can (and should) do this "off-transaction". Meaning, when you perform your initial "Name Change" operation, push an item onto a queue and have a worker perform the "Name Change". Since there is no web response / user waiting anxiously for the worker to complete the fact that it is ridiculously inefficient is inconsequential.
This is a primary design pattern for implementing eventual consistency within distributed systems.
Related
I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.
Imaging something like a blog posting system, built using Azure Storage Table.
A user posts a message and the database records user's Region, City and Language along with it.
After that, a user is able to browse all other user's posts and able to filter them by any combination of Region, City and Language. Or neither and see all of them.
I see several solutions:
Put each message in 8 different partitions with combinations of Region-City-Language (pros: lightning fast point queries on read; cons: 8 transactions per message on write).
Put each message in 4 different partitions with combinations of Region-City and an ability to do partition scan to filter by languages (pros: less transactions than (1); cons: partition scan, 4 transactions per message).
Put each message in partitions, based on user's ID (pros: single transaction per message; cons: slow table scan and partition scan after that).
The way i see it:
Fast reads, slow (and perhaps costly) writes.
Balanced reads/writes/cost.
Fast writes, slow (but cheap) reads.
By "cost/cheap" i mean pricing based on transactions (not space).
And by "balanced" i mean just among these variants.
Thought about using index tables, but can't see how they help here.
So the question is, perhaps there is another, better way?
I've decided to go with a variation of (1).
The difference is that i won't be storing ALL of combinations for Region-Location-Language. Instead i decided to store only uniques:
Table: FiltersByRegion
----------------------
Partition: Region
Row: Location.Language
Prop: Message
Table: FiltersByRegionPlace
---------------------------
Partition: Region.Location
Row: Language
Prop: Message
Table: FiltersByRegionLanguage
------------------------------
Partition: Region.Language
Row: Location
Prop: Message
Table: FiltersByLanguage
------------------------
Partition: Language
Row: Region.Location
Prop: Message
Because of the fact that i'm storing only uniques there won't be a lot of transactions per every post. Only those, that are not already present in database.
In other words, if there are a lot of posts from the same region-location-language, filter tables won't be updated and transactions won't be spent. Tests for uniques could use Redis to speed things a bit.
Filtering is now only a matter of picking the right table.
It depends on your scenarios and read/write pattern. You might want to consider some aspects:
Design for how the records will be queried. Putting them into a "Region-City-Language" partition with message ID as entity data may help in your fast query.
Each message may have a unique message ID and ID-Message mappings are saved in other tables, then every time you only need to update one table when a message is updated and the message ID referenced in other tables keeps unchanged.
Leverage ParitionKey and RowKey in your table design and query entities with both keys. For instance: "Region-City-Language" as partition keys and "User" as row keys.
Consider storing duplicate copies of entities for query scenarios. For example, if you have heavy users based and language based queries, you may consider have two tables with "user" and "language" as keys respectively.
Please also refer to https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/ for a full guide.
I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.
We are about to implement the Read portion of our CQRS system in-house with the goal being to vastly improve our read performance. Currently our reads are conducted through a web service which runs a Linq-to-SQL query against normalised data, involving some degree of deserialization from an SQL Azure database.
The simplified structure of our data is:
User
Conversation (Grouping of Messages to the same recipients)
Message
Recipients (Set of Users)
I want to move this into a denormalized state, so that when a user requests to see a feed of messages it reads from EITHER:
A denormalized representation held in Azure Table Storage
UserID as the PartitionKey
ConversationID as the RowKey
Any volatile data prone to change stored as entities
The messages serialized as JSON in an entity
The recipients of said messages serialized as JSON in an entity
The main problem with this the limited size of a row in Table Storage (960KB)
Also any queries on the "volatile data" columns will be slow as they aren't part of the key
A normalized representation held in Azure Table Storage
Different table for Conversation details, Messages and Recipients
Partition keys for message and recipients stored on the Conversation table.
Bar that; this follows the same structure as above
Gets around the maximum row size issue
But will the normalized state reduce the performance gains of a denormalized table?
OR
A denormalized representation held in SQL Azure
UserID & ConversationID held as a composite primary key
Any volatile data prone to change stored in separate columns
The messages serialized as JSON in a column
The recipients of said messages serialized as JSON in an column
Greatest flexibility for indexing and the structure of the denormalized data
Much slower performance than Table Storage queries
What I'm asking is whether anyone has any experience implementing a denormalized structure in Table Storage or SQL Azure, which would you choose? Or is there a better approach I've missed?
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Your primary driver for considering Azure Tables is to vastly improve read performance, and in your scenario using SQL Azure is "much slower" according to your last point under "A denormalized representation held in SQL Azure". I personally find this very surprising for a few reasons and would ask for detailed analysis on how this claim was made. My default position would be that under most instances, SQL Azure would be much faster.
Here are some reasons for my skepticism of the claim:
SQL Azure uses the native/efficient TDS protocol to return data; Azure Tables use JSON format, which is more verbose
Joins / Filters in SQL Azure will be very fast as long as you are using primary keys or have indexes in SQL Azure; Azure Tables do not have indexes and joins must be performed client side
Limitations in the number of records returned by Azure Tables (1,000 records at a time) means you need to implement multiple roundtrips to fetch many records
Although you can fake indexes in Azure Tables by creating additional tables that hold a custom-built index, you own the responsibility of maintaining that index, which will slow your operations and possibly create orphan scenarios if you are not careful.
Last but not least, using Azure Tables usually makes sense when you are trying to reduce your storage costs (it is cheaper than SQL Azure) and when you need more storage than what SQL Azure can offer (although you can now use Federations to break the single database maximum storage limitation). For example, if you need to store 1 billion customer records, using Azure Table may make sense. But using Azure Tables for increase speed alone is rather suspicious in my mind.
If I were in your shoes I would question that claim very hard and make sure you have expert SQL development skills on staff that can demonstrate you are reaching performance bottlenecks inherent of SQL Server/SQL Azure before changing your architecture entirely.
In addition, I would define what your performance objectives are. Are you looking at 100x faster access times? Did you consider caching instead? Are you using indexing properly in your database?
My 2 cents... :)
I won't try to argue on the exact definition of CQRS. As we are talking about Azure, I'll use it's docs as a reference. From there we can find that:
CQRS doesn't necessary requires that you use a separate read storage.
For greater isolation, you can physically separate the read data from the write data.
"you can" doesn't mean "you must".
About denormalization and read optimization:
Although
The read model of a CQRS-based system provides materialized views of the data, typically as highly denormalized views
the key point is
the read database can use its own data schema that is optimized for queries
It can be a different schema, but it can still be normalized or at least not "highly denormalized". Again - you can, but that doesn't mean you must.
More than that, if you performance is poor due to write locks and not because of heavy SQL requests:
The read store can be a read-only replica of the write store
And when we talk about request's optimization, it's better to talk more about requests themselves, and less about storage types.
About "it reads from either" [...]
The Materialized View pattern describes generating prepopulated views of data in environments where the source data isn't in a suitable format for querying, where generating a suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
Here the key point is that views are plural.
A materialized view can even be optimized for just a single query.
...
Materialized views tend to be specifically tailored to one, or a small number of queries
So you choice is not between those 3 options. It's much wider actually.
And again, you don't need another storage to create views. All can be done inside a single DB.
About
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Yes, of course, performance will suffer! (Also consider the matter of consistency). But will it be OK or not you can never be sure until you test it. With your data and your requests. Because delays in data transfers can actually be less than time required for some elaborate SQL-request.
So all boils down to:
What features do you need and which of them Table Storage and/or SQL Azure have?
And then, how much will it cost?
These you can only answer yourself. And these choices have little to do with performance. Because if there is a suitable index in either of those, I believe the performance will be virtually indistinguishable.
To sum up:
SQL Azure or Azure Table Storage?
For different requests and data you can and you probably should use both. But there is too little information in the question to give you the exact answer (we need an exact request for that). But I agree with #HerveRoggero - most probably you should stick with SQL Azure.
I am not sure if I can add any value to other answers, but I want to draw your attention toward modeling the data storage based on your query paths. Are you going to query all the mentioned data bits together? Is the user going to ask for some of it as additional information after a click or something? I am assuming that you have thought about this question already, and you are positive that you want to query everything in one go. i.e., the API or something needs to return all this information at once.
In that case, nothing will beat querying a single object by key. If you are talking about Azure's Table Storage specifically, it says right there that it's a key-value store. I am curious whether you have considered the document database (e.g. Cosmos DB) instead? If you are implementing CQRS read models, you could generate a single document per user that has all information that a user sees on a feed. You query that document by user id, which would be the key. This approach would be the optimal CQRS implementation in my mind because, after all, you are aiming to implement read models. Unless I misinterpreted something in your question or you have strong reasons to not go with document databases.
I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql