Recommended settings for writing to object stores says:
For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.
Is it safe to use the v2 algorithm to write out to Google Cloud Storage?
What, exactly, does it mean for the algorithm to be "not safe"? What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
aah. I wrote that bit of the docs. And one of the papers you cite.
GCP implements rename() non-atomically, so v1 isn't really any more robust than v2. And v2 can be a lot faster.
Azure "abfs" connector has O(1) Atomic renames, all good.
S3 has suffered from both performance and safety. As it is now consistent, there's less risk, but its still horribly slow on production datasets. Use a higher-perfomance committer (EMR spark commtter, S3A committer)
Or look at cloud-first formats like: Iceberg, Hudi, Delta Lake. This is where the focus is these days.
Update October 2022
Apache Hadoop 3.3.5 added in MAPREDUCE-7341 the Intermediate Manifest Committer for correctness, performance and scalability on abfs and gcs. (it also works on hdfs, FWIW). it commits tasks by listing the output directory trees of task attempts and saves the list of files to rename to a manifest file, which is committed atomically. Job commit is a simple series of
list manifest files to commit, load these as the list results are paged in.
create the output dir tree
rename all source files to the destination via a thread pool
task attempt cleanup, which again can be done in a thread pool for gcs performance
save the summary to the _SUCCESS json file, and, if you want, another dir. the summary includes statistics on all store IO done during task and job commit.
This is correct for GCS as it relies on a single file rename as the sole atomic operation.
For ABFS it adds support for rate limiting of IOPS and resilience to the way abfs fails when you try a few thousand renames in the same second. One of those examples of a problem which only surfaces in production, not in benchmarking.
This committer ships with Hadoop 3.3.5 and will not be backported -use hadoop binaries of this or a later version if you want to use it.
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
We see empirically that while v2 is faster, it also leaves behind
partial results on job failures, breaking transactionality
requirements. In practice, this means that with chained ETL jobs, a
job failure — even if retried successfully — could duplicate some of
the input data for downstream jobs. This requires careful management
when using chained ETL jobs.
It's safe as long as you manage partial writes on failure. And to elaborate, they mean safe in regard to rename safety in the part you quote. Of Azure, AWS and GCP only AWS S3 is eventual consistent and unsafe to use with the V2 algorithm even when no job failures happen. But GCP (nor Azure or AWS) is not safe in regards to partial writes.
FileOutputCommitter V1 vs V2
1. mapreduce.fileoutputcommitter.algorithm.version=1
AM will do mergePaths() in the end after all reducers complete.
If this MR job has many reducers, AM will firstly wait for all reducers to finish and then use a single thread to merge the outout files.
So this algorithm has some performance concern for large jobs.
2. mapreduce.fileoutputcommitter.algorithm.version=2
Each Reducer will do mergePaths() to move their output files into the final output directory concurrently.
So this algorithm saves a lot of time for AM when job is commiting.
http://www.openkb.info/2019/04/what-is-difference-between.html
If you can see Apache Spark documentation Google cloud marked safe in v1 version so its same in v2
What, exactly, does it mean for the algorithm to be "not safe"?
S3 there is no concept of renaming, so once the data is written to s3 temp location it again copied that data to new s3 location but Azure and google cloud stores do have directory renames
AWS S3 has eventual consistent meaning If you delete a bucket and immediately list all buckets, the deleted bucket might still appear in the list ,eventual consistency causes file not found expectation during partial writes and its not safe.
What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
What is the best practice writing massive amount of files to s3 using Spark
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
https://spark.apache.org/docs/3.1.1/cloud-integration.html#recommended-settings-for-writing-to-object-stores
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
https://github.com/steveloughran/zero-rename-committer/files/1604894/a_zero_rename_committer.pdf
Related
Ok, so, I have autoloader working in directory listing mode because the event driven mode requires way more elevated permissions that we can't in LIVE.
So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .
That's about it.
My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?
From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?
I plan to try all of them out, but if someone already has pointers, will be appreciated.
By the way, the storage here is Azure data lake gen 2.
If you don't do too many complex aggregations, then I would recommend to get to the "Compute Optimized" or "General Purpose" nodes for that work - the primary load would be anyway reading the data from files, combine them together and then write to ADLS, so here the more CPU power, the faster will be the data processing.
Only if you'll have too many small files (think about tens/hundreds of thousands) then you may consider bigger node for a driver whose role will be identifying the new files in the storage.
I am developing a distributed application in Python. The application has two major packages, Package A and Package B that work separately but communicate with each other through a queue. In other words Package A generates some files and enqueue (pushes) them to a queue and Package B dequeues (pops) the files on a first-come-first-service basis and processes them. Both Package A and B are going to be deployed on Google Cloud as docker containers.
I need to plan what is the best storage option to keep the files and the queue. Files and the queue could be stored and used temporarily.
I think that my options are Cloud buckets or Google datastore, but have no idea how to choose from and what could be the best option. The best option would be a solution that has a low cost, reliable and easy-to-use from the development aspect.
Any suggestion is welcome... Thanks!
Google Cloud Storage sounds like the right option for you because it supports large files. You have no need for features provided by datastore etc such as querying by other fields.
If you only need to process a file once, when it is first uploaded, you could use GCS pubsub notifications and trigger your processor from pubsub.
if you need more complex tasks, e.g. one task can dispatch to multiple child tasks that all operate on the same file, then it's probably better to use a separate task system like celery and pass the GCS URL in the task definition.
Background
The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
Questions
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.
I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).
I'm a bit lost in the variety of cloud options out there:
Amazon lambda has insufficient system resources (not enough memory).
Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
Amazon data pipeline looks to be the equivalent to Google Cloud DataFlow. (Added in edit for completeness.)
Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me but no APIs to learn.
Microsoft Data Lake is not released yet and looks hadoop-y.
Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).
Can anyone advise what appropriate solution(s) would be for this?
You should be able to do this with Dataflow quite easily. The pipeline could look something like (assuming your files are located on Google Cloud Storage, GCS):
class ImageProcessor {
public static void process(GcsPath path) {
// Open the image, do the processing you want, write
// the output to where you want.
// You can use GcsUtil.open() and GcsUtil.create() for
// reading and writing paths on GCS.
}
}
// This will work fine until a few tens of thousands of files.
// If you have more, let me know.
List<GcsPath> filesToProcess = GcsUtil.expand(GcsPath.fromUri("..."));
p.apply(Create.of(filesToProcess))
.apply(MapElements.via(ImageProcessor::process)
.withOutputType(new TypeDescriptor<Void>() {}));
p.run();
This is one of the common family of cases where Dataflow is used as an embarassingly-parallel orchestration framework rather than a data processing framework, but it should work.
You will need Dataflow SDK 1.2.0 to use the MapElements transform (support for Java 8 lambdas is new in 1.2.0).
Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.