What kind of Nodes to choose for Autoloader- Azure - azure

Ok, so, I have autoloader working in directory listing mode because the event driven mode requires way more elevated permissions that we can't in LIVE.
So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .
That's about it.
My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?
From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?
I plan to try all of them out, but if someone already has pointers, will be appreciated.
By the way, the storage here is Azure data lake gen 2.

If you don't do too many complex aggregations, then I would recommend to get to the "Compute Optimized" or "General Purpose" nodes for that work - the primary load would be anyway reading the data from files, combine them together and then write to ADLS, so here the more CPU power, the faster will be the data processing.
Only if you'll have too many small files (think about tens/hundreds of thousands) then you may consider bigger node for a driver whose role will be identifying the new files in the storage.

Related

Writing to Google Cloud Storage with v2 algorithm safe?

Recommended settings for writing to object stores says:
For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.
Is it safe to use the v2 algorithm to write out to Google Cloud Storage?
What, exactly, does it mean for the algorithm to be "not safe"? What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
aah. I wrote that bit of the docs. And one of the papers you cite.
GCP implements rename() non-atomically, so v1 isn't really any more robust than v2. And v2 can be a lot faster.
Azure "abfs" connector has O(1) Atomic renames, all good.
S3 has suffered from both performance and safety. As it is now consistent, there's less risk, but its still horribly slow on production datasets. Use a higher-perfomance committer (EMR spark commtter, S3A committer)
Or look at cloud-first formats like: Iceberg, Hudi, Delta Lake. This is where the focus is these days.
Update October 2022
Apache Hadoop 3.3.5 added in MAPREDUCE-7341 the Intermediate Manifest Committer for correctness, performance and scalability on abfs and gcs. (it also works on hdfs, FWIW). it commits tasks by listing the output directory trees of task attempts and saves the list of files to rename to a manifest file, which is committed atomically. Job commit is a simple series of
list manifest files to commit, load these as the list results are paged in.
create the output dir tree
rename all source files to the destination via a thread pool
task attempt cleanup, which again can be done in a thread pool for gcs performance
save the summary to the _SUCCESS json file, and, if you want, another dir. the summary includes statistics on all store IO done during task and job commit.
This is correct for GCS as it relies on a single file rename as the sole atomic operation.
For ABFS it adds support for rate limiting of IOPS and resilience to the way abfs fails when you try a few thousand renames in the same second. One of those examples of a problem which only surfaces in production, not in benchmarking.
This committer ships with Hadoop 3.3.5 and will not be backported -use hadoop binaries of this or a later version if you want to use it.
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
We see empirically that while v2 is faster, it also leaves behind
partial results on job failures, breaking transactionality
requirements. In practice, this means that with chained ETL jobs, a
job failure — even if retried successfully — could duplicate some of
the input data for downstream jobs. This requires careful management
when using chained ETL jobs.
It's safe as long as you manage partial writes on failure. And to elaborate, they mean safe in regard to rename safety in the part you quote. Of Azure, AWS and GCP only AWS S3 is eventual consistent and unsafe to use with the V2 algorithm even when no job failures happen. But GCP (nor Azure or AWS) is not safe in regards to partial writes.
FileOutputCommitter V1 vs V2
1. mapreduce.fileoutputcommitter.algorithm.version=1
AM will do mergePaths() in the end after all reducers complete.
If this MR job has many reducers, AM will firstly wait for all reducers to finish and then use a single thread to merge the outout files.
So this algorithm has some performance concern for large jobs.
2. mapreduce.fileoutputcommitter.algorithm.version=2
Each Reducer will do mergePaths() to move their output files into the final output directory concurrently.
So this algorithm saves a lot of time for AM when job is commiting.
http://www.openkb.info/2019/04/what-is-difference-between.html
If you can see Apache Spark documentation Google cloud marked safe in v1 version so its same in v2
What, exactly, does it mean for the algorithm to be "not safe"?
S3 there is no concept of renaming, so once the data is written to s3 temp location it again copied that data to new s3 location but Azure and google cloud stores do have directory renames
AWS S3 has eventual consistent meaning If you delete a bucket and immediately list all buckets, the deleted bucket might still appear in the list ,eventual consistency causes file not found expectation during partial writes and its not safe.
What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
What is the best practice writing massive amount of files to s3 using Spark
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
https://spark.apache.org/docs/3.1.1/cloud-integration.html#recommended-settings-for-writing-to-object-stores
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
https://github.com/steveloughran/zero-rename-committer/files/1604894/a_zero_rename_committer.pdf

How to effectively share data between scale set VMs

We have an application which is quite scalable as it is. Basically you have one or more stateless nodes that all do some independent work of files that are read and written to shared NFS share.
This NFS can be bottleneck but with local deployment customers just buys big enough box to have sufficient performance.
Now we are moving this to Azure and I would like to have a better more "cloudy" way of sharing data :) and running some Linux NFS server isn't ideal scenario if we need to manage them.
Is the Azure Blob storage the right tool for this job (https://azure.microsoft.com/en-us/services/storage/blobs/)?
we need good scalability. e.g. up to 10k files writen in a minute
files are quite small, less than 50KB per file on average
files created and read, not changed
files are short lived, we purge them every day
I am looking for more practical experience with this kind of storage and how good it really is.
There are two possible solutions to your request, either using Azure Storage Blobs (Recommended for your scenario) or Azure Files.
Azure Blobs has the following scaling targets:
It doesn't support the fact of attaching it a server, such as a network share.
Blobs do not support a hierarchy file structure besides having containers (Virtual folders can be accessed, but the con is you can't delete a container if it contains blobs- for the point about purging- but there are methods to do purging using your own code.)
Azure Files:
Links recommended:
Comparison between Azure Files and Blobs:https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
Informative SO post here

HDInsight vs. Virtualized Hadoop Cluster on Azure

I'm investigating two alternatives for using a Hadoop cluster, the first one is using HDInsight (with either Blob or HDFS storage) and the second alternative is deploying a powerful Windows Server on Microsoft Azure and run HDP (Hortonwork Data Processing) on it (using virtualization). The second alternative gives me more flexibility, however what I'm interested in is investigating the overhead of each alternative. Any ideas on that? Particularly how is the effect of Blob storage in the efficiency?
This is a pretty broad question, so an answer of "it depends," is appropriate here. When I talk with customers, this is how I see them making the tradeoff. It's a spectrum of control at one end, and convenience on the other. Do you have specific requirements on which Linux distro or Hadoop distro you deploy? Then you will want to go with IaaS and simply deploy there. That's great, you get a lot of control, but patching and operations are still your responsibility.
We refer to HDInsight as a managed service, and what we mean by that is that we take care of running it for you (eg, there is an SLA we provide on the cluster itself, and the apps running on it, not just "can I ping the vm"). We operate that cluster, patch the OS, patch Hadoop, etc. So, lots of convenience there, but, we don't let you choose which Linux distro or allow you to have an arbitrary set of Hadoop bits there.
From a perf perspective, HDInsight can deploy on any Azure node size, similar to IaaS VM's (this is a new feature launched this week). On the question of Blob efficiency, you should try both out and see what you think. The nice part about Blob store is you get more economic flexibility, you can deploy a small cluster on a massive volume of data if that cluster only needs to run on a small chunk of data (as compared to putting it all in HDFS, where you need all of the nodes running all of the time to fit all of your data).

Azure Table Storage, Cassandra or HBase for caching

I am developing site with a large number of users and contents, so I need to cache some data about their (in format like string key and serialized array or columns - millions rows). I look at Windows Azure Table Storage, Cassandra or HBase.
Plus in Azure Table Storage - scalability and it isn't necessary to think of place on disk, count of folders and files in folders and etc. Minus - I don't understand yet what speed of address to it will be via REST (time of request and response), so it is not locally address to file.
Plus in Cassandra or HBase - so checked systems. Minuses - should be developed and adjusted; I don't understand yet, what productivity should be provided for what size of data and a quantity of requests; will take a large place on a disk with a site (on VM with a site), so it is necessary to watch and scale, but on Azure it (scale VM storage) is not one button.
Please recommend what system to choose and why.

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?
Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...
I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.
Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.
There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.

Resources