ScriptRunConfig with datastore reference on AML - azure-machine-learning-service

When trying to run a ScriptRunConfig, using :
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', ds.as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)
It doesn't work and breaks with this when I submit the job :
... lots of things... and then
TypeError: Object of type 'DataReference' is not JSON serializable
However if I run it with the Estimator, it works. One of the differences is the fact that with a ScriptRunConfig we're using a list for parameters and the other is a dictionary.
Thanks for any pointers!

Being able to use DataReference in ScriptRunConfig is a bit more involved than doing just ds.as_mount(). You will need to convert it into a string in arguments and then update the RunConfiguration's data_references section with the DataReferenceConfiguration created from ds. Please see here for an example notebook on how to do that.
If you are just reading from the input location and not doing any writes to it, please check out Dataset. It allows you to do exactly what you are doing without doing anything extra. Here is an example notebook that shows this in action.
Below is a short version of the notebook
from azureml.core import Dataset
# more imports and code
ds = Datastore(workspace, 'mydatastore')
dataset = Dataset.File.from_files(path=(ds, 'path/to/input-data/within-datastore'))
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', dataset.as_named_input('input').as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)

you can see this link how-to-migrate-from-estimators-to-scriptrunconfig in official documents.
The core code of using DataReference in ScriptRunConfig is
# if you want to pass a DataReference object, such as the below:
datastore = ws.get_default_datastore()
data_ref = datastore.path('./foo').as_mount()
src = ScriptRunConfig(source_directory='.',
script='train.py',
arguments=['--data-folder', str(data_ref)], # cast the DataReference object to str
compute_target=compute_target,
environment=pytorch_env)
src.run_config.data_references = {data_ref.data_reference_name: data_ref.to_config()} # set a dict of the DataReference(s) you want to the `data_references` attribute of the ScriptRunConfig's underlying RunConfiguration object.

Related

How to pass RunProperties while calling the glue workflow using boto3 and python in lambda function?

My python code in lambda function:
import json
import boto3
from botocore.exceptions import ClientError
glueClient = boto3.client('glue')
default_run_properties = {'s3_path': 's3://bucketname/abc.zip'}
response = glue_client.start_workflow_run(Name="Testing",RunProperties=default_run_properties)
print(response)
I am getting error like this:
"errorMessage": "Parameter validation failed:\nUnknown parameter in input: \"RunProperties\", must be one of: Name",
"errorType": "ParamValidationError",
I also tried like this :
session = boto3.session.Session()
glue_client = session.client('glue')
But got the same error.
can anyone tell how to pass the RunProperties while calling the glue workflow to run .The RunProperties are dynamic need to be passed from lambda event.
I had the same issue and this is a bit tricky. I do not like my solution, so maybe someone else has a better idea? See here: https://github.com/boto/boto3/issues/2580
And also here: https://docs.aws.amazon.com/glue/latest/webapi/API_StartWorkflowRun.html
So, you cannot pass the parameters when starting the workflow, which is a shame in my opinion, because even the CLI suggests that: https://docs.aws.amazon.com/cli/latest/reference/glue/start-workflow-run.html
However, you can update the parameters before you start the workflow. These values are then set for everyone. If you expect any "concurrency" issues then this is not a good way to go. You need to decide, if you reset the values afterwards or just leave it to the next start of the workflow.
I start my workflows like this:
glue_client.update_workflow(
Name=SHOPS_WORKFLOW_NAME,
DefaultRunProperties={
's3_key': file_key,
'market_id': segments[0],
},
)
workflow_run_id = glue_client.start_workflow_run(
Name=SHOPS_WORKFLOW_NAME
)
This basically produces the following in the next run:
I had the same problem and asked in AWS re:Post. The problem is the old boto3 version used in Lambda. They recommended two ways to work around this issue:
Update run properties for a Job immediately after start_workflow_run:
default_run_properties = {'s3_path': 's3://bucketname/abc.zip'}
response = glue_client.start_workflow_run(Name="Testing")
updateRun = glue_client.put_workflow_run_properties(
Name = "Testing",
RunId = response['RunId'],
RunProperties = default_run_properties
)
Or you can create a lambda layer for your lambda function and include a new boto3 version there.

F Strings and Interpolation using a properties file

I have a simple python app and i'm trying to combine bunch of output messages to standardize output to the user. I've created a properties file for this, and it looks similar to the following:
[migration_prepare]
console=The migration prepare phase failed in {stage_name} with error {error}!
email=The migration prepare phase failed while in {stage_name}. Contact support!
slack=The **_prepare_** phase of the migration failed
I created a method to handle fetching messages from a Properties file... similar to:
def get_msg(category, message_key, prop_file_location="messages.properties"):
""" Get a string from a properties file that is utilized similar to a dictionary and be used in subsequent
messaging between console, slack and email communications"""
message = None
config = ConfigParser()
try:
dataset = config.read(prop_file_location)
if len(dataset) == 0:
raise ValueError("failed to find property file")
message = config.get(category, message_key).replace('\\n', '\n') # if contains newline characters i.e. \n
except NoOptionError as no:
print(
f"Bad option for value {message_key}")
print(f"{no}")
except NoSectionError as ns:
print(
f"There is no section in the properties file {prop_file_location} that contains category {category}!")
print(f"{ns}")
return f"{message}"
The method returns the F string fine, to the calling class. My question is, in the calling class if the string in my properties file contains text {some_value} that is intended to be interpolated by the compiler in the calling class using an F String with curly brackets, why does it return a string literal? The output is literal text, not the interpolated value I expect:
What I get The migration prepare phase failed while in {stage_name} stage. Contact support!
What I would like The migration prepare phase failed while in Reconciliation stage. Contact support!
I would like the output from the method to return the interpolated value. Has anyone done anything like this?
I am not sure where you define your stage_name but in order to interpolate in config file you need to use ${stage_name}
Interpolation in f-strings and configParser files are not the same.
Update: added 2 usage examples:
# ${} option using ExtendedInterpolation
from configparser import ConfigParser, ExtendedInterpolation
parser = ConfigParser(interpolation=ExtendedInterpolation())
parser.read_string('[example]\n'
'x=1\n'
'y=${x}')
print(parser['example']['y']) # y = '1'
# another option - %()s
from configparser import ConfigParser, ExtendedInterpolation
parser = ConfigParser()
parser.read_string('[example]\n'
'x=1\n'
'y=%(x)s')
print(parser['example']['y']) # y = '1'

NameError: name 'countryCodeMap' is not defined

I am trying to implement a Spark program in a Databricks Cluster and I am following the documentation whose link is as follows:
Now, after this line of code:
def mapKeyToVal(mapping):
def mapKeyToVal_(col):
return mapping.get(col)
return udf(mapKeyToVal_, StringType())
I am using this:
gameInfDf = gameInfDf.withColumn("country_code", mapKeyToVal(countryCodeMap)("country"))
And I am getting the Error: name 'countryCodeMap' is not defined
It will be great if anyone can help me with the same.
https://databricks.com/blog/2018/07/09/analyze-games-from-european-soccer-leagues-with-apache-spark-and-databricks.html is the formal guide for databricks.
See picture below. You need to click on the link and IMPORT the .dbc
You will then see various setupo things. E.g. the Maps needed. Good stuff.
You can see the maps, some of them:
situationMap = {1:'Open play', 2:'Set piece', 3:'Corner', 4:'Free kick', 99:'NA'}
countryCodeMap = {'germany':'DEU', 'france':'FRA', 'england':'GBR', 'spain':'ESP', 'italy':'ITA'}

How to create trigger with multiple substitutions variable in Google Cloud build with Python

I am working on Python code to create Google Cloud trigger, I am not able to add substitutions variable.
Currently I have below code
from google.cloud.devtools import cloudbuild_v1
client = cloudbuild_v1.CloudBuildClient()
build_trigger_template = cloudbuild_v1.types.BuildTrigger()
build_trigger_template.description = 'test to create trigger'
build_trigger_template.name = 'github-cloudbuild-trigger1'
build_trigger_template.github.name = 'github-cloudbuild'
build_trigger_template.github.pull_request.branch = 'master'
build_trigger_template.filename = 'cloudbuild.yaml'
response = client.create_build_trigger('dev',
build_trigger_template)
I want to add two substitutions variables _ENV and _PROJECT, I tried below mentioned way but not working.
build_trigger_template.substitutions = {'_ENV': 'test',
'_PROJECT': 'pro-test'}
Error: AttributeError: Assignment not allowed to repeated field "substitutions" in protocol message object.
Thanks,
Raghunath.
This is an issue with assigning protobuf object.
If you look on the object using dir(build_trigger_template.substitutions)
you'll find an .update method that will accept a dictionary.
so try the below, it should return None but your structure will be updated.
build_trigger_template.substitutions.update({'_ENV': 'test',
'_PROJECT': 'pro-test'})

Creating custom component in SpaCy

I am trying to create SpaCy pipeline component to return Spans of meaningful text (my corpus comprises pdf documents that have a lot of garbage that I am not interested in - tables, headers, etc.)
More specifically I am trying to create a function that:
takes a doc object as an argument
iterates over the doc tokens
When certain rules are met, yield a Span object
Note I would also be happy with returning a list([span_obj1, span_obj2])
What is the best way to do something like this? I am a bit confused on the difference between a pipeline component and an extension attribute.
So far I have tried:
nlp = English()
Doc.set_extension('chunks', method=iQ_chunker)
####
raw_text = get_test_doc()
doc = nlp(raw_text)
print(type(doc._.chunks))
>>> <class 'functools.partial'>
iQ_chunker is a method that does what I explain above and it returns a list of Span objects
this is not the results I expect as the function I pass in as method returns a list.
I imagine you're getting a functools partial back because you are accessing chunks as an attribute, despite having passed it in as an argument for method. If you want spaCy to intervene and call the method for you when you access something as an attribute, it needs to be
Doc.set_extension('chunks', getter=iQ_chunker)
Please see the Doc documentation for more details.
However, if you are planning to compute this attribute for every single document, I think you should make it part of your pipeline instead. Here is some simple sample code that does it both ways.
import spacy
from spacy.tokens import Doc
def chunk_getter(doc):
# the getter is called when we access _.extension_1,
# so the computation is done at access time
# also, because this is a getter,
# we need to return the actual result of the computation
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
return [first_half, secod_half]
def write_chunks(doc):
# this pipeline component is called as part of the spacy pipeline,
# so the computation is done at parse time
# because this is a pipeline component,
# we need to set our attribute value on the doc (which must be registered)
# and then return the doc itself
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
doc._.extension_2 = [first_half, secod_half]
return doc
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
Doc.set_extension("extension_1", getter=chunk_getter)
Doc.set_extension("extension_2", default=[])
nlp.add_pipe(write_chunks)
test_doc = nlp('I love spaCy')
print(test_doc._.extension_1)
print(test_doc._.extension_2)
This just prints [I, love spaCy] twice because it's two methods of doing the same thing, but I think making it part of your pipeline with nlp.add_pipe is the better way to do it if you expect to need this output on every document you parse.

Resources