QLDB - Indexed Storage vs Journal Storage

QLDB - Indexed Storage vs Journal Storage - amazon-qldb

As I am new to QLDB. I have some doubts
Why Indexed Storage consumes more size than Journal Storage?
During Large Data Insertion there is an increase in Journal Storage Size even after Insertion is Stopped for a certain Period.Can you explain me why?
What are the tables that are present in Index Storage along with current and History ?

Q1: the Journal is conceptually similar to tape. It's really good (performance and cost) for sequential reads and writes, but poor for random access. This is why we use it to record transactions. Indexed storage is designed for random access, but this is simply more expensive to implement. We bill for them separately to keep the cost of QLDB as low as we can make it.
Q2: the Journal size will not (ever) increase unless you are committing transactions. You may be noticing this in CloudWatch Metrics, which may be delayed by a few minutes (or longer, depending on how you are aggregating the metrics).
Q3: that's it, just your data. QLDB also offers some virtual tables (such as information_schema.user_tables), but these are not "in indexed storage" - i.e. you're not paying for them.

Related

How expensive is it to "index everything" on CosmosDB? (default behaviour)

By default, all the data in Azure Cosmos DB is indexed i.e. every property inside document/item has by default consistent/automatic indexing.
However, the cost of storing this large amount of data is not clearly visible to the end users.
How would you calculate/track the cost you are bearing for storing or using the index related data?
As the billing is only related to RU/s and data storage, it is not clear how the indexing strategy affects billing.
Also i wonder if the RU/s necessary for intensive writes maybe increased because of the indexes.
If so, indexes in CosmosDB should be excluded, and only necessary properties have to be indexed, thus reducing the overall performance.

As the billing is only related to RU/s and data storage, it is not
clear how the indexing strategy affects billing.
Indexing strategy will affect billing because if you index everything, you will consume more storage and that in turn will increase your bill. When an item is written to Cosmos DB, a part of your RU/s will be spent on indexing that item thus you will end up consuming more RU/s which will increase your bill.
You may find these links helpful in optimizing the costs as far as indexing is concerned:
Optimize cost with indexing: https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-storage#optimize-cost-with-indexing.
Optimize by changing indexing policy: https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-throughput#optimize-by-changing-indexing-policy.
Also i wonder if the RU/s necessary for intensive writes maybe
increased because of the indexes.
That is correct.
If so, indexes in CosmosDB should be excluded, and only necessary
properties have to be indexed, thus reducing the overall performance.
For bulk writes, it is recommended that you turn off indexing completely before doing the bulk writes and enable it once the write operations are completed. You can also request for Lazy Indexing as described here.
UPDATE - Adding comment from #MarkBrown (part of CosmosDB team):
I don't recommend using lazy indexing as there is no way to know when
it is caught up and the data can be queried consistently. Also, a
well-tuned indexing policy can be fine in bulk scenarios. Generally
best reserved for massive one-time data loads. Not recommended for
regular, batch type data ingestion.

Azure Table Storage vs Azure Document DB - performance comparison?

All other things being equal (regarding feature requirements, data requirements, etc), which is faster in the following functions:
Inserts
Updates
Reads
Deletes
Please, I'm looking for a straight comparison of these raw functions given a scenario where either could be used equally effectively in terms of feature requirements.

You're comparing apples and oranges, and there is no right answer to scenarios which you should choose one vs the other. But objectively, there are some discrete differences:
Table storage supports up to 2,000 transactions / sec, per partition (dictated by your chosen partition key), and 20,000 transactions / sec for an entire storage account. The number of transactions is not guaranteed, and varies based on entity size
DocumentDB, while not providing "transactions" per second, provides a guaranteed level of "Request Units" per second. And by measuring your various queries, you can then scale your database to provide an equivalent number of transactions per second that your app requires. DocumentDB, by allowing you to adjust RU for a given collection, effectively lets you scale to a larger transaction rate than possible with Table Storage (you can certainly utilize multiple storage accounts to raise your effective table storage transaction rate). DocumentDB offers up to 10K RU/sec per collection (standard collection) or 250K RU/sec (partitioned collection), and the limits may be raised as needed, per support.
Table Storage supports Entity Group Transactions, allowing for operations of up to 100 entities (and up to 4MB payload) to be batched into a single atomic transaction. Transactions are bound to a single partition.
DocumentDB allows for transactions to occur within the bounds of a collection. If multiple database operations are performed within a stored procedure, those operations succeed or fail atomically.
Table Storage is a key/value store, and lookups on partition key + row key yield very efficient point-lookups. Once you start examining properties other than PK/RK, you will enter the space of partition scan or table scan.
DocumentDB is a document store, and you may index any/all properties within a document.
Table Storage scales to 500TB per account.
DocumentDB scales to 250GB per collection, more if you request additional storage (e.g. 500TB).
Table Storage provides security via storage access key. There's a master storage account key, as well as the ability to generate Shared Access Signatures to provide specific access rights to specific tables.
DocumentDB has both read/write and read-only admin keys, along with user-level access to collections/documents
Table Storage and DocumentDB have very different pricing models (where Table Storage is simply a per-GB-per-month cost, along with a nominal cost for transactions). But back to my point of apples vs oranges: DocumentDB is a database engine - query language, server-side procedures, triggers, indexes, etc.
I'm sure there are some objective comparisons that I missed, but that should give you a good starting point for making your decision to use one, the other, or both. And how you choose to apply each of these to your apps is really up to you, and what your priorities are (Scale? Queries? Cost? etc...).

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!

From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.

The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.

Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

Dramatic decrease of Azure Table storage performance after querying whole partition

I use Azure Table storage as a time series database. The database is constantly extended with more rows, (approximately 20 rows per second for each partition). Every day I create new partitions for the day's data so that all partition have a similar size and never get too big.
Until now everything worked flawlessly, when I wanted to retrieve data from a specific partition it would never take more than 2.5 secs for 1000 values and on average it would take 1 sec.
When I tried to query all the data of a partition though things got really really slow, towards the middle of the procedure each query would take 30-40 sec for 1000 values.
So I cancelled the procedure just to re start it for a smaller range. But now all queries take too long. From the beginning all queries need 15-30 secs. Can that mean that data got rearranged in a non efficient way and that's why I am seeing this dramatic decrease in performance? If yes is there a way to handle such a rearrangement?

I would definitely recommend you to go over the links Jason pointed above. You have not given too much detail about how you generate your partition keys but from sounds of it you are falling into several anti patterns. Including by applying Append (or Prepend) and too many entities in a single partition. I would recommend you to reduce your partition size and also put either a hash or a random prefix to your partition keys so they are not in lexicographical order.
Azure storage follows a range partitioning scheme in the background, so even if the partition keys you picked up are unique, if they are sequential they will fall into the same range and potentially be served by a single partition server, which would hamper the ability of azure storage service overall to load balance and scale out your storage requests.
The other aspect you should think is how you are reading the entities back, the best recommendation is point query with partition key and row key, worst is a full table scan with no PK and RK, there in the middle you have partition scan which in your case will also be pretty bad performance due to your partition size.

One of the challenges with time series data is that you can end up writing all your data to a single partition which prevents Table Storage from allocating additional resources to help you scale. Similarly for read operations you are constrained by potentially having all your data in a single partition which means you are limited to 2000 entities / second - whereas if you spread your data across multiple partitions you can parallelize the query and yield far greater scale.
Do you have Storage Analytics enabled? I would be interested to know if you are getting throttled at all or what other potential issues might be going on. Take a look at the Storage Monitoring, Diagnosing and Troubleshooting guide for more information.
If you still can't find the information you want please email AzTableFeedback#microsoft.com and we would be happy to follow up with you.
The Azure Storage Table Design Guide talks about general scalability guidance as well as patterns / anti-patterns (see the append only anti-pattern for a good overview) which is worth looking at.

Is a cloud service suitable for this application?

I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.

Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .

Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string