Databricks : java.io.IOException: Failed to perform getMountFileState(forceRefresh=true) - databricks

In Databricks I tried this :
storageName = "..."
accessKey = "..."
containerName = '...'
#try:
dbutils.fs.mount(
source = "wasbs://"+containerName+"\#"+storageName+".blob.core.windows.net/",
mount_point = "/mnt/foldername",
extra_configs = {"fs.azure.account.key."+storageName+".blob.core.windows.net": accessKey})
#except:
# print("hello")
#pass
And I had this error:
java.io.IOException: Failed to perform 'getMountFileState(forceRefresh=true)' for mounts after 3 attempts. Please, retry the operation.
Original exception: 'shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation.
What can I do to avoid this ?

Related

How to format the file path in an MLTable for Azure Machine Learning uploaded during a pipeline job?

How is the path to a (.csv) file to be expressed in a MLTable file
that is created in a local folder but then uploaded as part of a
pipline job?
I'm following the Jupyter notebook automl-forecasting-task-energy-demand-advance from the azuerml-examples repo (article and notebook). This example has a MLTable file as below referencing a .csv file with a relative path. Then in the pipeline the MLTable is uploaded to be accessible to a remote compute (a few things are omitted for brevity)
my_training_data_input = Input(
type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)
compute = AmlCompute(
name=compute_name, size="STANDARD_D2_V2", min_instances=0, max_instances=4
)
forecasting_job = automl.forecasting(
compute=compute_name, # name of the compute target we created above
# name="dpv2-forecasting-job-02",
experiment_name=exp_name,
training_data=my_training_data_input,
# validation_data = my_validation_data_input,
target_column_name="demand",
primary_metric="NormalizedRootMeanSquaredError",
n_cross_validations="auto",
enable_model_explainability=True,
tags={"my_custom_tag": "My custom value"},
)
returned_job = ml_client.jobs.create_or_update(
forecasting_job
)
ml_client.jobs.stream(returned_job.name)
But running this gives the error
Error meassage:
Encountered user error while fetching data from Dataset. Error: UserErrorException:
Message: MLTable yaml schema is invalid:
Error Code: Validation
Validation Error Code: Invalid MLTable
Validation Target: MLTableToDataflow
Error Message: Failed to convert a MLTable to dataflow
uri path is not a valid datastore uri path
| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "MLTable yaml schema is invalid: \nError Code: Validation\nValidation Error Code: Invalid MLTable\nValidation Target: MLTableToDataflow\nError Message: Failed to convert a MLTable to dataflow\nuri path is not a valid datastore uri path\n| session_id=857bd9a1-097b-4df6-aa1c-8871f89580d8"
}
}
paths:
- file: ./nyc_energy_training_clean.csv
transformations:
- read_delimited:
delimiter: ','
encoding: 'ascii'
- convert_column_types:
- columns: demand
column_type: float
- columns: precip
column_type: float
- columns: temp
column_type: float
How am I supposed to run this? Thanks in advance!
For Remote PATH you can use the below and here is the document for create data assets.
It's important to note that the path specified in the MLTable file must be a valid path in the cloud, not just a valid path on your local machine.

Unable to download terraform modules from azure repo (Private repo)

My terraform-modules repo location is like this:
https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster
I have three directories/modules at root level, namely compute, resourcegroup and sqlserver.
However, when I run terraform init. terraform is unable to download the required modules.
main.tf
module "app_vms" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster"
rg_name = var.resource_group_name
location = module.resource_group.external_rg_location
vnet_name = var.virtual_network_name
subnet_name = var.sql_subnet_name
app_nsg = var.application_nsg
vm_count = var.count_vm
base_hostname = var.app_host_basename
sto_acc_suffix = var.storage_account_suffix
vm_size = var.virtual_machine_size
vm_publisher = var.virtual_machine_image_publisher
vm_offer = var.virtual_machine_image_offer
vm_sku = var.virtual_machine_image_sku
vm_img_version = var.virtual_machine_image_version
username = var.username
password = var.password
allowed_source_ips = var.ip_list
}
module "resource_group" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fresourcegroup&version=GBmaster"
rg_name = "test_rg"
}
module "azure_paas_sqlserver" {
source = "https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fsqlserver&version=GBmaster"
}
It gives me a series of errors like below:(sample only give not all the errors as they are same)
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"https://teamabc.visualstudio.com/dummpproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
no source URL was returned
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc.visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
no source URL was returned
I tried to remove https:// part but no luck. The repo does require username and password to login.
Wondering if I should be making a public repo in github? but push within the organization is to use Azure Repos.
Post First comment
Thanks for the lead, I did tried but still no charm.
My source url now looks like below
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster"
I get error below:
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster":
error downloading
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster':
/usr/bin/git exited with 128: Cloning into '.terraform/modules/sql_vms'...
fatal: repository
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster/'
not found
Here:
teamabc.visuastudio.com is the parent azure devops url
dummyproject is the project name
After Charles Response
Error: Failed to download module
Could not download module "sql_vms" (main.tf:1) source code from
"git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git":
error downloading
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git':
/usr/bin/git exited with 128: Cloning into '.terraform/modules/sql_vms'...
fatal: repository
'https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git/'
not found
You can take a look at Generic Git Repository, the URL should be a Git URL. And finally, it should like this:
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git"
Or you can select a branch from your Git Repository like this:
source = "git::https://teamabc:lfithww4xpp4eksvoimgzkpi3ugu6xvrkf26mfq3jth3642jgyoa#visualstudio.com/dummyproject/_git/terraform-modules?path=%2Fcompute&version=GBmaster.git?ref=<branch>"
Finally, got it working by below command:
git::https://<PAT TOKEN>#<Azure DevOps URL>/DefaultCollection/<PROJECT NAME>/_git/<REPO NAME>//<sub directory>

Terraform CLI : Error: Failed to read ssh private key: no key found

I have this variable private_key_path = "/users/arun/aws_keys/pk.pem" defined in terraform.tfvars file
and i am doing SSH in my terraform-template . see the configuration below
connection {
type = "ssh"
host = self.public_ip
user = "ec2-user"
private_key = file(var.private_key_path)
}
The private file is very much available in that path. But still i get the below exception thrown by the terraform-cli
Error: Failed to read ssh private key: no key found
Is there anything else am missing out ?
generate the public and private key using gitbash.
$ ssh-keygen.exe -f demo
call the demo file or copy the demo and demo.pub file to the specific directory

Difficulties in using a Gcloud Composer DAG to run a Spark job

I'm playing around with Gcloud Composer, trying to create a DAG that creates a DataProc cluster, runs a simple Spark job, then tears down the cluster. I am trying to run the Spark PI example job.
I understand that when calling DataProcSparkOperator I can choose only to define either the main_jar or the main_class property. When I define main_class, the job fails with the error:
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I choose to define the main_jar property, the job fails with the error:
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
I'm at a bit of a loss as to how to resolve this, as I am kinda new to both Spark and DataProc.
My DAG:
import datetime as dt
from airflow import DAG, models
from airflow.contrib.operators import dataproc_operator as dpo
from airflow.utils import trigger_rule
MAIN_JAR = 'file:///usr/lib/spark/examples/jars/spark-examples.jar'
MAIN_CLASS = 'org.apache.spark.examples.SparkPi'
CLUSTER_NAME = 'quickspark-cluster-{{ ds_nodash }}'
yesterday = dt.datetime.combine(
dt.datetime.today() - dt.timedelta(1),
dt.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project')
}
with DAG('dataproc_spark_submit', schedule_interval='0 17 * * *',
default_args=default_dag_args) as dag:
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
project_id = default_dag_args['project_id'],
task_id = 'create_dataproc_cluster',
cluster_name = CLUSTER_NAME,
num_workers = 2,
zone = models.Variable.get('gce_zone')
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
#main_jar = MAIN_JAR,
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
project_id = default_dag_args['project_id'],
task_id = 'delete_dataproc_cluster',
cluster_name = CLUSTER_NAME,
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster
I compared it with a successful job using the CLI and saw that, even when the class was populating the Main class or jar field, the path to the Jar was specified in Jar files:
Checking the operator I noticed there is also a dataproc_spark_jars parameter which is not mutually exclusive to main_class:
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
dataproc_spark_jars = [MAIN_JAR],
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
Adding it did the trick:

Error while submitting a spark job using spark-jobserver

I face following error occasionally while submitting job. This error goes away if I remove the rootdir of filedao, datadao and sqldao. That means I have to restart the job-server and re-upload my jar.
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka://JobServer/user/context-supervisor/1995aeba-com.spmsoftware.distributed.job.TestJob#-1370794810]] after [10000 ms]. Sender[null] sent message of type \"spark.jobserver.JobManagerActor$StartJob\".",
"errorClass": "akka.pattern.AskTimeoutException",
"stack": ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)", "akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)", "scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)", "scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)", "akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)", "akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)", "akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)", "java.lang.Thread.run(Thread.java:745)"]
}
}
My config file is as follows:
# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
#
# Spark Cluster / Job Server configuration
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = <spark_master>
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
context-per-jvm = false
context-creation-timeout = 100 s
# Note: JobFileDAO is deprecated from v0.7.0 because of issues in
# production and will be removed in future, now defaults to H2 file.
jobdao = spark.jobserver.io.JobSqlDAO
filedao {
rootdir = /tmp/spark-jobserver/filedao/data
}
datadao {
rootdir = /tmp/spark-jobserver/upload
}
sqldao {
slick-driver = slick.driver.H2Driver
jdbc-driver = org.h2.Driver
rootdir = /tmp/spark-jobserver/sqldao/data
jdbc {
url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
user = ""
password = ""
}
dbcp {
enabled = false
maxactive = 20
maxidle = 10
initialsize = 10
}
}
result-chunk-size = 1m
short-timeout = 60 s
}
context-settings {
num-cpu-cores = 2 # Number of cores to allocate. Required.
memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, #1G, etc.
}
}
akka {
remote.netty.tcp {
# This controls the maximum message size, including job results, that can be sent
# maximum-frame-size = 200 MiB
}
}
# check the reference.conf in spray-can/src/main/resources for all defined settings
spray.can.server.parsing.max-content-length = 250m
I am using spark-2.0-preview version.
I have faced the same error before and was related with timeout, for sure is an syncronus request (sync=true) togheter you must provide the timeout (in seconds) who is a value relative with how long it takes to process your request.
This an example how the request should look like:
curl -k --basic -d '' 'http://localhost:5050/jobs?appName=app&classPath=Main&context=test-context&sync=true&timeout=40'
if your request needs more than 40 seconds maybe you also need to modify the application.conf located on
spark-jobserver-master/job-server/src/main/resources/application.conf
ànd on the spray.can.server section modify:
idle-timeout = 210 s
request-timeout = 200 s

Resources