I am using the Core API and the NodeJS API for Cosmos DB. I am trying to do a point read to save on RUs and latency. This documentation lead me to this as my "solution".
However, this makes no sense to me. I believe from similar systems that needs the item ID and partition key, but this makes no reference to the latter to top things off.
By modifying some update code, by mostly pure luck I ended up with what MIGHT be a point read but it gets the full item, not the "map" value I am looking for.
const { resource: updated } = await container
.item(
id = email,
partitionKeyValue = email
)
.read('SELECT c.map');
console.log(updated)
How do I read just the "map" value? The full document has much more in it and it would probably waste the benefit of a point read to get the whole thing.
There are two ways to read data: either via a query (where you can do a projection, grouping, filtering, etc) or by a point-read (a direct read, specifying id and partition key value), and that point-read bypasses the query engine.
Point-reads cost a bit less in RU, but could potentially consume a bit more bandwidth, as it returns an entire document (the actual underlying API call only allows for ID plus partition key value, and returns a single matching document, in its entirety).
Via query, you have flexibility to return as much or as little as you want, but the resulting operation will cost a bit more in RU.
Related
I have
an endpoint in an Azure Function called "INSERT" that that inserts a
record in Table Storage using a batch operation.
an endpoint in a different Azure Function
called "GET" that gets a record in Table Storage.
If I insert an item and then immediately get that same item, then the item has not appeared yet!
If I delay by one second after saving, then I find the item.
If I delay by 10ms after saving, then I don't find the item.
I see the same symptom when updating an item. I set a date field on the item. If I get immediately after deleting, then some times the date field is not set yet.
Is this known behavior in Azure Table Storage? I know about ETags as described here but I cannot see how it applies to this issue.
I cannot easily provide a code sample because this is distributed among multiple functions and I think if I did put it in a simpler example, then there would be some mechanism that would see I am calling from the same ip or with the same client and manage to return the recently saved item.
As mentioned in the comments, Azure Table Storage is Strongly Consistent. Data is available to you as soon as it is written to Storage.
This is in contrast with Cosmos DB Table Storage where there are many consistency levels and data may not be immediately available for you to read after it is written depending on the consistent level set.
The issue was related to my code and queues running in the background.
I had shut down the Function that has queue triggers but to my surprise I found that the Function in my staging slot was picking items off the queue. That is why it made a difference whether I delay for a second or two.
And to the second part, why a date field is seemingly not set as fast as I get it. Well, it turns out I had filtered by columns, like this:
var operation = TableOperation.Retrieve<Entity>(partitionKey, id, new List<string> { "Content", "IsDeleted" });
And to make matters worse, the class "Entity" that I deserialize to, of course had default primitive values (such as "false") so it didn't look like they were not being set.
So the answer does not have much to do with the question, so in summary, for anyone finding this question because they are wondering the same thing:
The answer is YES - Table Storage is in fact strongly consistent and it doesn't matter whether you're 'very fast' or connect from another location.
I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.
We are about to implement the Read portion of our CQRS system in-house with the goal being to vastly improve our read performance. Currently our reads are conducted through a web service which runs a Linq-to-SQL query against normalised data, involving some degree of deserialization from an SQL Azure database.
The simplified structure of our data is:
User
Conversation (Grouping of Messages to the same recipients)
Message
Recipients (Set of Users)
I want to move this into a denormalized state, so that when a user requests to see a feed of messages it reads from EITHER:
A denormalized representation held in Azure Table Storage
UserID as the PartitionKey
ConversationID as the RowKey
Any volatile data prone to change stored as entities
The messages serialized as JSON in an entity
The recipients of said messages serialized as JSON in an entity
The main problem with this the limited size of a row in Table Storage (960KB)
Also any queries on the "volatile data" columns will be slow as they aren't part of the key
A normalized representation held in Azure Table Storage
Different table for Conversation details, Messages and Recipients
Partition keys for message and recipients stored on the Conversation table.
Bar that; this follows the same structure as above
Gets around the maximum row size issue
But will the normalized state reduce the performance gains of a denormalized table?
OR
A denormalized representation held in SQL Azure
UserID & ConversationID held as a composite primary key
Any volatile data prone to change stored in separate columns
The messages serialized as JSON in a column
The recipients of said messages serialized as JSON in an column
Greatest flexibility for indexing and the structure of the denormalized data
Much slower performance than Table Storage queries
What I'm asking is whether anyone has any experience implementing a denormalized structure in Table Storage or SQL Azure, which would you choose? Or is there a better approach I've missed?
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Your primary driver for considering Azure Tables is to vastly improve read performance, and in your scenario using SQL Azure is "much slower" according to your last point under "A denormalized representation held in SQL Azure". I personally find this very surprising for a few reasons and would ask for detailed analysis on how this claim was made. My default position would be that under most instances, SQL Azure would be much faster.
Here are some reasons for my skepticism of the claim:
SQL Azure uses the native/efficient TDS protocol to return data; Azure Tables use JSON format, which is more verbose
Joins / Filters in SQL Azure will be very fast as long as you are using primary keys or have indexes in SQL Azure; Azure Tables do not have indexes and joins must be performed client side
Limitations in the number of records returned by Azure Tables (1,000 records at a time) means you need to implement multiple roundtrips to fetch many records
Although you can fake indexes in Azure Tables by creating additional tables that hold a custom-built index, you own the responsibility of maintaining that index, which will slow your operations and possibly create orphan scenarios if you are not careful.
Last but not least, using Azure Tables usually makes sense when you are trying to reduce your storage costs (it is cheaper than SQL Azure) and when you need more storage than what SQL Azure can offer (although you can now use Federations to break the single database maximum storage limitation). For example, if you need to store 1 billion customer records, using Azure Table may make sense. But using Azure Tables for increase speed alone is rather suspicious in my mind.
If I were in your shoes I would question that claim very hard and make sure you have expert SQL development skills on staff that can demonstrate you are reaching performance bottlenecks inherent of SQL Server/SQL Azure before changing your architecture entirely.
In addition, I would define what your performance objectives are. Are you looking at 100x faster access times? Did you consider caching instead? Are you using indexing properly in your database?
My 2 cents... :)
I won't try to argue on the exact definition of CQRS. As we are talking about Azure, I'll use it's docs as a reference. From there we can find that:
CQRS doesn't necessary requires that you use a separate read storage.
For greater isolation, you can physically separate the read data from the write data.
"you can" doesn't mean "you must".
About denormalization and read optimization:
Although
The read model of a CQRS-based system provides materialized views of the data, typically as highly denormalized views
the key point is
the read database can use its own data schema that is optimized for queries
It can be a different schema, but it can still be normalized or at least not "highly denormalized". Again - you can, but that doesn't mean you must.
More than that, if you performance is poor due to write locks and not because of heavy SQL requests:
The read store can be a read-only replica of the write store
And when we talk about request's optimization, it's better to talk more about requests themselves, and less about storage types.
About "it reads from either" [...]
The Materialized View pattern describes generating prepopulated views of data in environments where the source data isn't in a suitable format for querying, where generating a suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
Here the key point is that views are plural.
A materialized view can even be optimized for just a single query.
...
Materialized views tend to be specifically tailored to one, or a small number of queries
So you choice is not between those 3 options. It's much wider actually.
And again, you don't need another storage to create views. All can be done inside a single DB.
About
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Yes, of course, performance will suffer! (Also consider the matter of consistency). But will it be OK or not you can never be sure until you test it. With your data and your requests. Because delays in data transfers can actually be less than time required for some elaborate SQL-request.
So all boils down to:
What features do you need and which of them Table Storage and/or SQL Azure have?
And then, how much will it cost?
These you can only answer yourself. And these choices have little to do with performance. Because if there is a suitable index in either of those, I believe the performance will be virtually indistinguishable.
To sum up:
SQL Azure or Azure Table Storage?
For different requests and data you can and you probably should use both. But there is too little information in the question to give you the exact answer (we need an exact request for that). But I agree with #HerveRoggero - most probably you should stick with SQL Azure.
I am not sure if I can add any value to other answers, but I want to draw your attention toward modeling the data storage based on your query paths. Are you going to query all the mentioned data bits together? Is the user going to ask for some of it as additional information after a click or something? I am assuming that you have thought about this question already, and you are positive that you want to query everything in one go. i.e., the API or something needs to return all this information at once.
In that case, nothing will beat querying a single object by key. If you are talking about Azure's Table Storage specifically, it says right there that it's a key-value store. I am curious whether you have considered the document database (e.g. Cosmos DB) instead? If you are implementing CQRS read models, you could generate a single document per user that has all information that a user sees on a feed. You query that document by user id, which would be the key. This approach would be the optimal CQRS implementation in my mind because, after all, you are aiming to implement read models. Unless I misinterpreted something in your question or you have strong reasons to not go with document databases.
Everyone warns not to query against anything other than RowKey or PartitionKey in Azure Table Storage (ATS), lest you be forced to table scan. For a while, this has paralyzed me into trying to come up with exactly the right PK and RK and creating pseudo-secondary indexes in other tables when I needed to query something else.
However, it occurs to me that I would commonly table scan in SQL Server when I thought appropriate.
So the question becomes, how fast can I table scan an Azure Table. Is this a constant in entities/second or does it depend on record size, etc. Are there some rules of thumb as to how many records is too many to table scan if you want a responsive application?
The issue of a table scan has to do with crossing the partition boundaries. The level of performance you are guaranteed is explicity set at the partition level. therefore, when you run a full table scan, its a) not very efficient, b) doesn't have any guarantee of performance. This is because the partitions themselves are set on seperate storage nodes, and when you run a cross partition scan, you're consuming potentially massive amounts of resources (tieing up multiple nodes simultaneously).
I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost).
If the number of partitions/nodes you're crossing is fairly small, you likely won't notice any issues.
But please don't quote me on this. I'm not an expert on Azure Storage. Its actually the area of Azure I'm the least knowledgeable about. :P
I think Brent is 100% on the money, but if you still feel you want to try it, I can only suggest to run some tests to find out yourself. Try include the partitionKey in your queries to prevent crossing partitions because at the end of the day that's the performance killer.
Azure tables are not optimized for table scans. Scanning the table might be acceptable for a long-running background job, but I wouldn't do it when a quick response is needed. With a table of any reasonable size you will have to handle continuation tokens as the query reaches a partition boundary or obtains 1k results.
The Azure storage team has a great post which explains the scalability targets. The throughput target for a single table partition is 500 entities/sec. The overall target for a storage account is 5,000 transactions/sec.
The answer is Pagination. Use the top_size -- max number of results or records in result -- in conjunction with next_partition_key and next_row_key the continuation tokens. That makes a significant even factorial difference in performance. For one, your results are statistically more likely to come from a single partition. Plain results show that sets are grouped by the partition continuation key and not the row continue key.
In other words, you also need to think about your UI or system output. Don't bother returning more than 10 to 20 results max 50. The user probably wont utilize or examine any more.
Sounds foolish. Do a Google search for "dog" and notice that the search returns only 10 items. No more. The next records are avail for you if you bother to hit 'continue'. Research has proven that almost no user ventures beyond that first page.
the select (returning a subset of the key-values) may make a difference; for example, use select = "PartitionKey,RowKey" or 'Name' whatever minimum you need.
"I believe, that the effect of crossing these boundaries also results
in continuation tokens, which require additional round-trips to
storage to retrieve the results. This results then in reducing
performance, as well as an increase in transaction counts (and
subsequently cost)."
...is slightly incorrect. the continuation token is used not because of crossing boundaries but because azure tables permit no more than 1000 results; therefore the two continuation tokens are used for the next set. default top_size is essentially 1000.
For your viewing pleasure, here's the description for queries entities from the azure python api. others are much the same.
'''
Get entities in a table; includes the $filter and $select options.
table_name: Table to query.
filter:
Optional. Filter as described at
http://msdn.microsoft.com/en-us/library/windowsazure/dd894031.aspx
select: Optional. Property names to select from the entities.
top: Optional. Maximum number of entities to return.
next_partition_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextPartitionKey']
next_row_key:
Optional. When top is used, the next partition key is stored in
result.x_ms_continuation['NextRowKey']
'''
I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?
Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".
This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.