Storage Optimization - google-cloud-spanner

I am working on a project that uses Cloud Spanner, and my team wants to optimize the storage in our database.
We are interested in determining how much storage we are using by points like row or column.
E.g.
We have a column of type STRING(36) and a column of type STRING(MAX)
If both columns contain a 36 character string, will the storage used be equal?
I tried reading Cloud Spanner documentation and we are running tests that create new databases and compare the total size. We are expecting to learn more about how we can optimize our Cloud Spanner storage usage.

Spanner will only use the amount of storage required for the values you actually write to such columns, for both BYTES(MAX) and STRING(MAX). The only effect is a write-time enforcement of the size limit.

Related

Cassandra maximum realistic blob size

I'm trying to evaluate a few distributed storage platforms and Cassandra is one of them.
Our requirement is to save files between 1MB and 50MB of size and according to Cassandra's documentation http://docs.datastax.com/en/cql/3.3/cql/cql_reference/blob_r.html:
The maximum theoretical size for a
blob is 2 GB. The practical limit on blob size, however, is less than
1 MB.
Does anyone have experience storing files in Cassandra as blobs? Any luck with it? Is the performance really bad with bigger file sizes?
Any other suggestion would also be appreciated!
Cassandra was not build for these type of job.
In Cassandra a single column value size can be: 2 GB ( 1 MB is recommended). So If you want to use use cassandra as object storage, split the big object into multiple small object and store them with object id as partition key and bucket id as clustering key.
It is best to use Distributed Object Storage System like OpenStack Object Storage ("Swift")
The OpenStack Object Store project, known as Swift, offers cloud storage software so that you can store and retrieve lots of data with a simple API. It's built for scale and optimized for durability, availability, and concurrency across the entire data set. Swift is ideal for storing unstructured data that can grow without bound.

Azure Table Storage vs Azure Document DB - performance comparison?

All other things being equal (regarding feature requirements, data requirements, etc), which is faster in the following functions:
Inserts
Updates
Reads
Deletes
Please, I'm looking for a straight comparison of these raw functions given a scenario where either could be used equally effectively in terms of feature requirements.
You're comparing apples and oranges, and there is no right answer to scenarios which you should choose one vs the other. But objectively, there are some discrete differences:
Table storage supports up to 2,000 transactions / sec, per partition (dictated by your chosen partition key), and 20,000 transactions / sec for an entire storage account. The number of transactions is not guaranteed, and varies based on entity size
DocumentDB, while not providing "transactions" per second, provides a guaranteed level of "Request Units" per second. And by measuring your various queries, you can then scale your database to provide an equivalent number of transactions per second that your app requires. DocumentDB, by allowing you to adjust RU for a given collection, effectively lets you scale to a larger transaction rate than possible with Table Storage (you can certainly utilize multiple storage accounts to raise your effective table storage transaction rate). DocumentDB offers up to 10K RU/sec per collection (standard collection) or 250K RU/sec (partitioned collection), and the limits may be raised as needed, per support.
Table Storage supports Entity Group Transactions, allowing for operations of up to 100 entities (and up to 4MB payload) to be batched into a single atomic transaction. Transactions are bound to a single partition.
DocumentDB allows for transactions to occur within the bounds of a collection. If multiple database operations are performed within a stored procedure, those operations succeed or fail atomically.
Table Storage is a key/value store, and lookups on partition key + row key yield very efficient point-lookups. Once you start examining properties other than PK/RK, you will enter the space of partition scan or table scan.
DocumentDB is a document store, and you may index any/all properties within a document.
Table Storage scales to 500TB per account.
DocumentDB scales to 250GB per collection, more if you request additional storage (e.g. 500TB).
Table Storage provides security via storage access key. There's a master storage account key, as well as the ability to generate Shared Access Signatures to provide specific access rights to specific tables.
DocumentDB has both read/write and read-only admin keys, along with user-level access to collections/documents
Table Storage and DocumentDB have very different pricing models (where Table Storage is simply a per-GB-per-month cost, along with a nominal cost for transactions). But back to my point of apples vs oranges: DocumentDB is a database engine - query language, server-side procedures, triggers, indexes, etc.
I'm sure there are some objective comparisons that I missed, but that should give you a good starting point for making your decision to use one, the other, or both. And how you choose to apply each of these to your apps is really up to you, and what your priorities are (Scale? Queries? Cost? etc...).

Dramatic decrease of Azure Table storage performance after querying whole partition

I use Azure Table storage as a time series database. The database is constantly extended with more rows, (approximately 20 rows per second for each partition). Every day I create new partitions for the day's data so that all partition have a similar size and never get too big.
Until now everything worked flawlessly, when I wanted to retrieve data from a specific partition it would never take more than 2.5 secs for 1000 values and on average it would take 1 sec.
When I tried to query all the data of a partition though things got really really slow, towards the middle of the procedure each query would take 30-40 sec for 1000 values.
So I cancelled the procedure just to re start it for a smaller range. But now all queries take too long. From the beginning all queries need 15-30 secs. Can that mean that data got rearranged in a non efficient way and that's why I am seeing this dramatic decrease in performance? If yes is there a way to handle such a rearrangement?
I would definitely recommend you to go over the links Jason pointed above. You have not given too much detail about how you generate your partition keys but from sounds of it you are falling into several anti patterns. Including by applying Append (or Prepend) and too many entities in a single partition. I would recommend you to reduce your partition size and also put either a hash or a random prefix to your partition keys so they are not in lexicographical order.
Azure storage follows a range partitioning scheme in the background, so even if the partition keys you picked up are unique, if they are sequential they will fall into the same range and potentially be served by a single partition server, which would hamper the ability of azure storage service overall to load balance and scale out your storage requests.
The other aspect you should think is how you are reading the entities back, the best recommendation is point query with partition key and row key, worst is a full table scan with no PK and RK, there in the middle you have partition scan which in your case will also be pretty bad performance due to your partition size.
One of the challenges with time series data is that you can end up writing all your data to a single partition which prevents Table Storage from allocating additional resources to help you scale. Similarly for read operations you are constrained by potentially having all your data in a single partition which means you are limited to 2000 entities / second - whereas if you spread your data across multiple partitions you can parallelize the query and yield far greater scale.
Do you have Storage Analytics enabled? I would be interested to know if you are getting throttled at all or what other potential issues might be going on. Take a look at the Storage Monitoring, Diagnosing and Troubleshooting guide for more information.
If you still can't find the information you want please email AzTableFeedback#microsoft.com and we would be happy to follow up with you.
The Azure Storage Table Design Guide talks about general scalability guidance as well as patterns / anti-patterns (see the append only anti-pattern for a good overview) which is worth looking at.

Are there any limits on the number of Azure Storage Tables allowed in one account?

I'm currently trying to store a fairly large and dynamic data set.
My current design is tending towards a solution where I will create a new table every few minutes - this means every table will be quite compact, it will be easy for me to search my data (I don't need everything in one table) and it should make it easy for me to delete stale data.
I've looked and I can't see any documented limits - but I wanted to check:
Is there any limit on the number of tables allowed within one Azure storage account?
Or can I keep adding potentially thousands of tables without any concern?
There are no published limits to the number of tables, only the 100TB 500TB limit on a given storage account. Combined with partition+row, it sounds like you'll have a direct link to your data without running into any table-scan issues.
This MSDN article explicitly calls out: "You can create any number of tables within a given storage account, as long as each table is uniquely named." Have fun!

Is a cloud service suitable for this application?

I'm looking for details of the cloud services popping up (eg. Amazon/Azure) and am wondering if they would be suitable for my app.
My application basically has a single table database which is about 500GB. It grows by 3-5 GB/Day.
I need to extract text data from it, about 1 million rows at a time, filtering on about 5 columns. This extracted data is usually about 1-5 GB and zips up to 100-500MB and then made available on the web.
There are some details of my existing implementation here
One 400GB table, One query - Need Tuning Ideas (SQL2005)
So, my question:
Would the existing cloud services be suitable to host this type of app? What would the cost be to store this amount of data and bandwidth (bandwidth usage would be about 2GB/day)?
Are the persistence systems suitable for storing large flat tables like this, and do they offer the ability to search on a number of columns?
My current implementation runs on sub $10k hardware so it wouldn't make sense to move if costs are much higher than, say, $5k/yr.
Given the large volume of data and the rate that it's growing, I don't think that Amazon would be a good option. I'm assuming that you'll want to be storing the data on a persistent storage. But with EC2 you need to allocate a given amount of storage and attach it as a disk. Unless you want to allocate a really large amount of space (and then will be paying for unused disc space), you will have to constantly be adding more discs. I did a quick back of the envalop calculation and I estimate it will cost between $2,500 - $10,000 per year for hosting. It's difficult for me to estimate accurately because of all of the variable things that amazon charges for (instance uptime, storage space, bandwidth, disc io, etc.) Here's the EC2 pricing .
Assuming that this is non-relational data (can't do relational data on a single table) you could consider using Azure Table Storage which is a storage mechanism designed for non-relational structured data.
The problem that you will have here is that Azure Tables only have a primary index and therefore cannot be indexed by 5 columns as you require. Unless you store the data 5 times, indexed each time by the column you wish to filter on. Not sure that would work out very cost-effective though.
Costs for Azure Table storage is from as little as 8c USD per Gig per month, depending on how much data you store. There are also charges per transaction and charges for Egress data.
For more info on pricing check here; http://www.windowsazure.com/en-us/pricing/calculator/advanced/
Where do you need to access this data from?
How is it written to?
Based on this there could be other options to consider too, like Azure Drives etc.

Resources