azure HDInsight script action - azure

I am trying to copy a file from a accessible data lake to blob storage while spinning up the cluster.
I am using this command from Azure documentation
hadoop distcp adl://data_lake_store_account.azuredatalakestore.net:443/myfolder wasb://container_name#storage_account_name.blob.core.windows.net/example/data/gutenberg
Now, If I am trying to automate this instead of hardcoding, how do I use this in script action. To be specific how can I dynamically get the the container name and storage_account_name associated while spinning up the cluster.

First as below,
A Script Action is simply a Bash script that you provide a URI to, and parameters for. The script runs on nodes in the HDInsight cluster.
So you just need to refer to the offical tutorial Script action development with HDInsight to write your script action and know how to run it. Or you can call the REST API Run Script Actions on a running cluster (Linux cluster only) to run it automatically.
For how to dynamically get the container name & storage account, a way for any language is to call the REST API Get configurations and extract the property of you want from the core-site in the JSON response, or just to call Get configuration REST API with parameter core-site as {configuration Type} in the url and extract the property of you want from the JSON response.
Hope it helps.

Related

Run ADX script on cluster scope with bicep

I use Azure Devops pipelines. You can run a script on the database level with Bicep, that is listed clearly in the documents. But I want to run a script on cluster level to update the workload_group policy to increase the allowed concurrent queries. But when running the query as part of the bicep deployment (on the database script property) to alter this it results in the following error:
Reason: Not a database-scope command
How can I run this query (that should indeed be run on a cluster level) as part of the bicep deployment? I use the following query, that does work when running it in the query window in Azure Portal.
.create-or-alter workload_group ['default'] ```
<<workgroupConfig>>
```.
I also know there are tasks for Azure Devops for running scripts against the database, but I would not like to use those since data explorer is in a private network and not accessible publicly.

Extracting Spark logs (Spark UI contents) from Databricks

I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?
The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.

Getting job owner from Databricks CLI

I'm trying to obtain the owner of a list of jobs on databricks using CLI. The issue is, the command databricks jobs list doesn't have any information related with that, any suggestions?
Thanks in advance!
You need to use Permissions API for that. Specifically the Get Job Permissions Levels API. Just do a GET request against endpoint https://<databricks-instance>/api/2.0/preview/permissions/jobs/{job_id}/permissionLevels - replace {job_id} with actual Job ID

Cloud Function in Python3 - copy from Google Cloud Bucket to another Google Cloud Bucket

Google have Cloud Storage Data Transfer option to copy from one bucket to another but this will only work if both the buckets are in the same project. Using gutil -m rsync -r -d is an easy option to run as cron but we are migrating all bash to python3. So I need a python 3 script to use it as google cloud function to do a weekly copy whole bucket from project1 to another bucket in project2.
Language: python 3
app : Cloud Function
Process : Copy one bucket to another
Source Project: project1
Source bucket : bucket1
Dest Project: project2
Dest Bucket: bucket2
pseudo cmd: rsync -r gs://project1/bucket1 gs://project2/bucket2
Any quick and readable python 3 code script to do that.
A python script to do this will be really slow.
I would use a Dataflow (apache bream) batch process to do this. You can code this in python3 easily.
Basically you need:
One Operation to list all files.
One shuffle() operation to distribute the load among several workers.
One Operation to actually copy from source to destination.
The good part is Google will scale the workers for you and won't take much time.
You'll be billed for the storage operations and the gigabytes + cpu that takes to move al data.
Rsync is not an operation that can't be performed via a single request in the storage rest API, and gsutil is not available on Cloud Functions, for this reason rsync both buckets via a python script is not possible.
You can create a function to start a preemptible VM with a startup script that executes the rsync between buckets and shut down the instance after finalizing the rsync operation.
By using a VM instead of a serverless service you can avoid any timeout that could be generated by a long rsync process.
A preemptible VM can run for up to 24Hours before been stopped and you only will charged by the time that the instance is turned on (the disk storage will be charged independently of the status)
If the VM is powered off before a minute you won't be charged by the usage.
For this approach first is necessary to create a bash script in a bucket, this will be executed by the preemptible VM at the startup time for example:
#! /bin/bash
gstuil rsync -r gs://mybucket1 gs://mybucket2
sudo init 0 #this is similar to poweroff, halt or shutdown -h now
After that, you need to create a preemptible VM with a Startup script, I recommend an f1-micro instance since the rsync command between buckets doesn't require so much resources.
1.- go to the VM Instances page.
2.- Click Create instance.
3.- On the Create a new instance page, fill in the properties for your instance.
4.- Click Management, security, disks, networking, sole tenancy.
5.In the Identity and API access section, select a service account that has access to read your startup script file in Cloud Storage and the buckets to be synced
Select Allow full access to all Cloud APIs.
7.- Under Availability policy, set the Preemptibility option to On. This setting disables automatic restart for the instance, and sets the host maintenance action to Terminate.
8.- In the Metadata section, provide startup-script-url as the metadata key.
9.- In the Value box, provide a URL to the startup script file, either in the gs://BUCKET/FILE or https://storage.googleapis.com/BUCKET/FILE format.
10.Click Create to create the instance.
With this configuration every time that your instance will be started the script also will be executed.
This is the python function to start a VM (independently if this is preemptible)
def power(request):
import logging
# this libraries are mandatory to reach compute engine api
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
# the function will take the service account of your function
credentials = GoogleCredentials.get_application_default()
# this line is to specify the api that we gonna use, in this case compute engine
service = discovery.build('compute', 'v1', credentials=credentials, cache_discovery=False)
# set correct log level (to avoid noise in the logs)
logging.getLogger('googleapiclient.discovery_cache').setLevel(logging.ERROR)
# Project ID for this request.
project = "yourprojectID" # Update placeholder value.
zone = "us-central1-a" # update this to the zone of your vm
instance = "myvm" # update with the name of your vm
response = service.instances().start(project=project, zone=zone, instance=instance).execute()
print(response)
return ("OK")
requirements.txt file
google-api-python-client
oauth2client
flask
And you can schedule your function by Cloud Scheduler:
Create a service account with functions.invoker permission within your function
Create new Cloud scheduler job
Specify the frequency in cron format.
Specify HTTP as the target type.
Add the URL of your cloud function and method as always.
Select the token OIDC from the Auth header dropdown
Add the service account email in the Service account text box.
In audience field you must only need to write the URL of the function without any additional parameter
On cloud scheduler, I hit my function by using these URL
https://us-central1-yourprojectID.cloudfunctions.net/power
and I used this audience
https://us-central1-yourprojectID.cloudfunctions.net/power
please replace yourprojectID in the code and in the URLs and the zone us-central1

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

Resources