Running HDInsight jobs howto - azure

Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?

Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.


Databricks REST API throttling and capacity restrictions/limits

I've scaled up the hardware on an azure-databricks cluster ("all-purpose" cluster) appropriately so that it should handle a very large amount of work. The application is designed in a way where incoming data is processed in smallish, discrete chunks. The jobs run in ~20 to 30 seconds. But there is a high degree of concurrent jobs that need to execute at the same time (eg. anywhere from 0 to 50 simultaneous jobs).
The only approach for delivering jobs to the cluster seems to be by way of their REST API in azure databricks (doc: )
Everything behaves normally until the number of concurrent jobs reaches 10 or so. At that point I see an unreasonable deterioration in throughput. But if I check ganglia or custom telemetry, there appears to be no reason for the deteriorated performance.
My suspicion is that the REST API itself is introducing an artificial bottleneck and they are throttling the number of jobs I can send over to my cluster. This was not self-evident to me. If I am paying for a large cluster, I should be allowed to send jobs to it. The REST API seems to be doing little more than serving as a communication channel that allows me to transmit my requests to my cluster. That API is the last place I would expect to find a resource bottleneck. A Spark developer would naturally investigate their code, then the cluster hardware. The REST API is not a reasonable place for Databricks to be introducing some additional, secretive limitations.
Does anyone know of another way to transmit distinct jobs to a cluster without going thru the REST API? Eg. is there a way for the driver node in the cluster to spawn additional/distinct/first-class jobs without being counted against our REST API allowance?
This issue seems silly and artificial. The secretive nature of these limits is bothersome to me as well. If they are throttling the REST API then there should be a warning, error, or ganglia chart for that. Otherwise developers will struggle with the performance issues using trial and error and guesswork.
Any help is appreciated. I'd prefer not to go all the way back to the drawing board, because of an artificial restriction in their REST API (one that was probably put in place to protect an underpowered "control plane").
Spark is awesome, but it isn't designed to be a high-concurrency database. The folks at Databricks have done a lot to lift the concurrency limitations of Spark, it still isn't a high-concurrecy solution.
In other words, your problem isn't the REST API ... it's the Spark engine in Databricks.
I know you don't want to go back to the drawing board, but the choices here are all bad ones:
you can run multiple Databricks clusters ( ) and use NGINX or some other load balancer to distribute the API requests. This will get expensive, quickly, but will avoid redesign.
If your use case supports it, try using a real-time database that supports high concurrency. I like Druid (see or if you want a managed version), but there are others in the same category

Keep track of all the parameters of spark-submit

I have a team where many member has permission to submit Spark tasks to YARN (the resource management) by command line. It's hard to track who is using how much cores, who is using how much memory...e.g. Now I'm looking for a software, framework or something could help me monitor the parameters that each member used. It will be a bridge between client and YARN. Then I could used it to filter the submit commands.
I did take a look at mlflow and I really like the MLFlow Tracking but it was designed for ML training process. I wonder if there is an alternative for my purpose? Or there is any other solution for the problem.
Thank you!
My recommendation would be to build such a tool yourself as its not too complicated,
have a wrapper script to spark submit which logs the usage in a DB and after the spark job finishes the wrapper will know to release information. could be done really easily.
In addition you can even block new spark submits if your team already asked for too much information.
And as you build it your self its really flexible as you can even create "sub teams" or anything you want.

How to execute parallel computing between several instances in Google Cloud Compute Engine?

I've recently encountered a problem to process a pickle file of 8 Gigabytes with a Python script using VMs in Google Cloud Compute Engine. The problem is that the process takes too long and I am searching for ways to decrease the time of processing. One of possible solutions could be sharing the processes in the script or map them between CPUs of several VMs. If somebody knows how to perform it, please, share with me!))
You can use Clusters for Large-scale Technical Computing in the Google Cloud Platform (GCP). There are open source software like ElastiCluster provide cluster management and support for provisioning nodes while using Google Compute Engine (GCE).
After the cluster is operational, workload manager manages the task execution and node allocation. There are a variety of popular commercial and open source workload managers such as HTCondor from the University of Wisconsin, Slurm from SchedMD, Univa Grid Engine, and LSF Symphony from IBM.
This article is also helpful.
it looks like an HPC problem. Look at this link:
There are lot of valuable solutions to your problem but it depends on the details of your case. A first simple approach could be to logically split your task in small jobs. Then you can assign a subset of these jobs to each GCE instance in your group of dedicated instances.
You can consider to create a group of a predefined number of instances. Each run could rely on a startup scripts in order to reach out the job it must execute. When the job finishes the instance can be deleted and substituted by a new one (Google Compute Engine Managed Instance Groups will create a new instance automatically). You must only manage when the group should start and stop.
Furthermore, you can consider preemptible instances (more cheaper).
Hope this helps you.

Dynamic Service Creation to Distribute Load

The problem we're facing is that we are doing video encoding and want to distribute the load to multiple nodes in the cluster.
We would like to constrain the number of video encoding jobs on a particular node to some maximum value. We would also like to have small video encoding jobs sent to a certain grouping of nodes in the cluster, and long video encoding jobs sent to another grouping of nodes in the cluster.
The idea behind this is to help maintain fairness amongst clients by partitioning the large jobs into a separate pool of nodes. This helps ensure that the small video encoding jobs are not blocked / throttled by a single tenant running a long encoding job.
Using Service Fabric
We plan on using an ASF service for the video encoding. With this in mind we had an idea of dynamically creating a service for each job that comes in. Placement constraints could then be used to determine which pool of nodes a job would run in. Custom metrics based on memory usage, CPU usage ... could be used to limit the number of active jobs on a node.
With this method the node distributing the jobs would have to poll whether a new service could currently be created that satisfies the placement constraints and metrics.
What happens when a service can't be placed on a node? (Using CreateServiceAsync I assume?)
Will this polling be prohibitively expensive?
Our video encoding executable is packaged along with the service which is approximately 80MB. Will this make the spinning up of a new service take a long time? (Minutes vs seconds)
As an alternative to this we could use a reliable queue based system, where the large jobs pool pulls from one queue and the small jobs pool pulls from another queue. This seems like the simpler way, but I want to explore all options to make sure I'm not missing out on some of the features of Service Fabric. Is there another better way you would suggest?
I have no experience with placement constraints and dynamic services, so I can't speak to that.
The polling of the perf counters isn't terribly expensive, that being said it's not a free operation. A one second poll interval shouldn't cause any huge perf impact while still providing a decent degree of resolution.
The service packages get copied to each node at deployment time rather than when services get spun up, so it'll make the deployment a bit slower but not affect service creation.
You're going to want to put the job data in reliable collections any way you structure it, but the question is how. One idea I just had that might be worth considering is making the job processing service a partitioned service and base your partitioning strategy based off encoding job size and/or tenant so that large jobs from the same tenant get stuck in the same queue, and smaller jobs for others go elsewhere.
As an aside, one thing I've dealt with in the past is SF remoting limits the size of the messages sent and throws if its too big, so if your video files are being passed from service to service you're going to want to consider a paging strategy for inter service communication.

One-time jobs on Azure workers

What is the best way to do a one-time job on Azure?
Say we want to extend a table in the associated database with a double column. All the new entries will have this value computed by the worker(s) at insertion, but somebody has to take care of the entries that are already in the table. I thought of two alternatives:
a method called by the worker only if a database entry (say, "JobRun") is set to true, and the method would flip the entry to false.
a separate app that does the job, and which is downloaded and run manually using remote desktop (I cannot connect the local app to the Azure SQL server).
The first alternative is messy (how should I deal with the code at the next deployment? delete it? comment it? leave it there? also, what if I will have another job in the future? create a new database entry "Job2Run"?). The second one looks like a cheap hack. I am sure that there is a better way I could not think of.
If you want to run a job once you'll need to take into account the following:
Concurrency: While the job is running, make sure no other worker picks up the job and runs it at the same time (you can use leases for this. More info here).
Once the job is done, you'll need to keep track (in Table Storage, SQL Azure, ...) that the job completed successfully. The next time a worker tries to pick up the job, it will look in Table Storage / SQL Azure / ..., it will see that the job completed and skip the job.
Failure: Maybe your worker crashes during the job which should allow another worker to pick up the job without any issue.
In your specific use case I would also consider using a tool like dbup to manage updates to your schema and existing data with SQL Scripts. Tools like these keep track of which scripts have been executed by adding them in a table in the database.
