The script install_hdisamples.py is executed at HDInsight cluster creation. It creates some folders with file samples and some tables in the referenced storage.
Is it possible to avoid the deployment of such script at cluster creation? Can it be done in the ARM template?
I don't think it's currently possible to avoid running the script. That said, you could create a script action that undoes what the script does. This is a flaky solution, and I acknowledge it's not perfect.
I am more interested in understanding the motivation for wanting to stop the script from running. Would you mind elaborating?
Related
I have a pyspark batch job scheduled on YARN. There is now a requirement to put the logic of the spark job into a web service.
I really don't want there to be 2 copies of the same code, and therefore would like to somehow reuse the spark code inside the service, only replacing the IO parts.
The expected size of the workloads per request is small so I don't want to complicate the service by turning it into a distributed application. I would like instead to run the spark code in local mode inside the service. How do I do that? Is that even a good idea? Are there better alternatives?
We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
I'm trying to install Giraph on HDInsight cluster with hadoop, using script actions.
After 30+- minutes when deploying the cluster, an error shows up.
Deployment failed
Deployment to resource group 'graphs' failed. Additional details from
the underlying API that might be helpful: At least one resource
deployment operation failed. Please list deployment operations for
details. Please see https://aka.ms/arm-debug for usage details.
Thanks in advance.
Thanks a lot for reporting this issue. We found what the issue is and fixed it.
Issue: There’s a deadlock when the Giraph script is provided during cluster creation. The Giraph script waits for /example/jars to be created in DFS (Wasb/ADLS), but /example/jars can only be created after the Giraph script completes. This issue doesn’t repro for runtime scripts since at the point the script is run, /example/jars has already existed.
Note: We have created and deployed the fix for the scripts. And I have also tested creating a cluster with the updated version, which works fine. Please test on your side and let me know.
I have created an Azure HDInsight cluster using PowerShell. Now I need to install some custom software on the worker nodes that is required for the mappers I will be running using Hadoop streaming. I haven't found any PowerShell command that could help me with this task. I can prepare a custom job that will setup all the workers, but I'm not convinced that this is the best solution. Are there better options?
edit:
With AWS Elastic MapReduce there is an option to install additional software in a bootstrap action that is defined when you create a cluster. I was looking for something similar.
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data.
from: Create Bootstrap Actions to Install Additional Software
The short answer is that you don't. It's not ideal from a caching perspective, but you ought to be able to bundle all your job dependencies into the map reduce jar which is distributed across the cluster for you by YARN (part of Hadoop). This is broadly speaking transparent to the end user, as it's all handled through the job submission process.
If you need something large which is a shared dependency across many jobs, and you don't want it copied out every time, you can keep it on wasb:// storage, and reference that in a class path, but that might cause you complexity if you are for instance using the .NET Streaming API.
I've just heard from a collage that I need to update my Azure PS because recently a new Cmdlet Add-AzureHDInsightScriptAction was added and it does just that.
Customize HDInsight clusters using Script Action