I am developing site with a large number of users and contents, so I need to cache some data about their (in format like string key and serialized array or columns - millions rows). I look at Windows Azure Table Storage, Cassandra or HBase.
Plus in Azure Table Storage - scalability and it isn't necessary to think of place on disk, count of folders and files in folders and etc. Minus - I don't understand yet what speed of address to it will be via REST (time of request and response), so it is not locally address to file.
Plus in Cassandra or HBase - so checked systems. Minuses - should be developed and adjusted; I don't understand yet, what productivity should be provided for what size of data and a quantity of requests; will take a large place on a disk with a site (on VM with a site), so it is necessary to watch and scale, but on Azure it (scale VM storage) is not one button.
Please recommend what system to choose and why.
Related
Ok, so, I have autoloader working in directory listing mode because the event driven mode requires way more elevated permissions that we can't in LIVE.
So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .
That's about it.
My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?
From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?
I plan to try all of them out, but if someone already has pointers, will be appreciated.
By the way, the storage here is Azure data lake gen 2.
If you don't do too many complex aggregations, then I would recommend to get to the "Compute Optimized" or "General Purpose" nodes for that work - the primary load would be anyway reading the data from files, combine them together and then write to ADLS, so here the more CPU power, the faster will be the data processing.
Only if you'll have too many small files (think about tens/hundreds of thousands) then you may consider bigger node for a driver whose role will be identifying the new files in the storage.
In Azure Cosmos DB, is it possible to create multiple read replicas at a database / container / partition key level to increase read throughput? I have several containers that will need more than 10K RU/s per logical partition key, and re-designing my partition key logic is not an option right now. Thus, I'm thinking of replicating data (eventual consistency is fine) several times.
I know Azure offers global distribution with Cosmos DB, but what I'm looking for is replication within the same region and ideally not a full database replication but a container replication. A container-level replication will be more cost effective since I don't need to replicate most containers and I need to replicate the others up to 10 times.
Few options available though:
Within same region no replication option but you could use the Change Feed to replicate to another DB with the re-design in mind just for the purpose of using with read queries. Though it might be a better idea to either use the serverless option which is in preview or use the auto scale option as well. But you can also look at provisioned throughput and reserve the provisioned RUs for 1 year or 3 year and pay monthly just as you would in PAYG model but with a huge discount. One option would also be to do a latency test from the VM in the region where your main DB and app is running and finding out the closest region w.r.t Latency (ms) and then if the latency is bearable then you can use global replication to that region and start using it. I use this tool for latency tests but run it from the VM within the region where your app\DB is running.
My guess is your queries are all cross-partition and each query is consuming a ton of RU/s. Unfortunately there is no feature in Cosmos DB to help in the way you're asking.
Your options are to create more containers and use change feed to replicate data to them in-region and then add some sort of routing mechanism to route requests in your app. You will of course only get eventual consistency with this so if you have high concurrency needs this won't work. Your only real option then is to address the scalability issues with your design today.
There is something that can help is this live data migrator. You can use this to keep a second container in sync with your original that will allow you to eventually migrate off your first design to another that scales better.
We are trying to restore one database backup, stored in Azure, to multiple SQL instances at the same time and are running into issues, getting error descriptions like "Desc=Open devices!" and "Desc=Create Memory. ErrorCode=(5)Access is denied.".
Is this possible? Or do they need to be restored successively?
SQL Azure has logic to do various operations online/automatically for you (copy databases, restore databases, backups, upgrades, etc.). Some operations are limited to 25 in parallel. There are IOs required to do each operation, so there are limits in place because the machine does not have infinite IOPS. (Those limits may change a bit over time as Microsoft improve the service, get newer hardware, etc.).
You can restore in parallel N databases from a database backup but still you have the IOPS limit. You can try a larger reservation size for the source and target during restore operations to get more IOPS and lower the time to perform the operations.
Try to create a bacpac of the database you want to restore and mix restore from backups with restore from bacpacs in parallel to workaround limits without adding IOPS and increasing costs.
We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.
There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here
So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?
Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...
I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.
Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.
There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.