Databricks DBX named parameters through job - databricks

I am trying to implement this(where I don't have variables in the conf file but passed it as named arguments)
mentioned here.
When running in local mode and a python debugger, I can easily pass this as:
Fundingsobj = SomeClass(init_conf={"localmode": "true", "fundingsdatapath": "tmp/fundings"})
Fundingsobj.launch()
where SomeClass inherits Task
However, I can't seem to pass this through the deployment.yaml. I have tried many versions
This is how I tried to read
class SomeClass(Task):
"""Class containing methods for generating test data."""
def initialize(self):
"""Initialize method."""
self.localmode = self.conf["localmode"]
This works fine, if I use the normal --conf-file option in the deployment.yaml and then provide the values there, or use the init_conf when I use the local debugger.
How do I pass variables to the job without relying on a conf file?
Idea is, after the job is deployed in Databricks, I would like to schedule it from Airflow by passing variables everyday.
Error while launching the job -
EDIT 1:
I have tried to use kwargs, but even that gives me the same error:
named_parameters: {"localmode": "true","fundingsdatapath": "tmp/fundings"}
and then try to consume, using
def initialize(self, **kwargs):
"""Initialize method."""
self.localmode = kwargs["localmode"]

I found the answer. Basically, one has to use argparse for this.
So, after I defined my yaml to be (in deployment.yaml)
- name: "clientscoretestdatageneratorusingparams"
tasks:
- task_key: "loadtestdataparams"
<<:
- *basic-static-cluster
libraries:
- pypi:
package: someadditionalpkg
repo: http://internalartifactoryurl
python_wheel_task:
package_name: "workflows"
entry_point: clientscoretestdatagenerator
named_parameters: {"localmode": "true","fundingsdatapath": "tmp/fundings"}
then in the entrypoint method
from argparse import ArgumentParser
....class and its other methods.....
....................................
def entrypoint(): # pragma: no cover
"""Entrypoint for spark wheel jobs."""
parser = ArgumentParser()
parser.add_argument("--localmode", dest="localmode", default=False)
parser.add_argument("--fundingsdatapath", dest="fundingsdatapath", default="tmp/fundings")
parser.add_argument("--datalakename", dest="datalakename", default="datalakename")
args = parser.parse_args()
fundingsobj = GenerateClientScoreData()
fundingsobj.launch(args)
and then consume using
def initialize(self, args):
"""Initialize method."""
self.localmode = args.localmode

Related

Django initialising AppConfig multiple times

I wanted to use the ready() hook in my AppConfig to start django-rq scheduler job. However it does so multiple times, every times I start the server. I imagine that's due to threading however I can't seem to find a suitable workaround. This is my AppConfig:
class AnalyticsConfig(AppConfig):
name = 'analytics'
def ready(self):
print("Init scheduler")
from analytics.services import save_hits
scheduler = django_rq.get_scheduler('analytics')
scheduler.schedule(datetime.utcnow(), save_hits, interval=5)
Now when I do runserver, Init scheduler is displayed 3 times. I've done some digging and according to this question I started the server with --noreload which didn't help (I still got Init scheduler x3). I also tried putting
import os
if os.environ.get('RUN_MAIN', None) != 'true':
default_app_config = 'analytics.apps.AnalyticsConfig'
in my __init__.py however RUN_MAIN appears to be None every time.
Afterwards I created a FileLock class, to skip configuration after the first initialization, which looks like this:
class FileLock:
def __get__(self, instance, owner):
return os.access(f"{instance.__class__.__name__}.lock", os.F_OK)
def __set__(self, instance, value):
if not isinstance(value, bool):
raise AttributeError
if value:
f = open(f"{instance.__class__.__name__}.lock", 'w+')
f.close()
else:
os.remove(f"{instance.__class__.__name__}.lock")
def __delete__(self, obj):
raise AttributeError
class AnalyticsConfig(AppConfig):
name = 'analytics'
locked = FileLock()
def ready(self):
from analytics.services import save_hits
if not self.locked:
print("Init scheduler")
scheduler = django_rq.get_scheduler('analytics')
scheduler.schedule(datetime.utcnow(), save_hits, interval=5)
self.locked = True
This does work, however the lock is not destroyed after the app quits. I tried removing the .lock files in settings.py but it also runs multiple times, making this pointless.
My question is: How can I prevent django from calling ready() multiple times, or how otherwise can I teardown the .lock files after django exits or right after it boots?
I'm using python 3.8 and django 3.1.5

__post_init__ of python 3.x dataclasses is not called when loaded from yaml

Please note that I have already referred to StackOverflow question here. I post this question to investigate if calling __post_init__ is safe or not. Please check the question till the end.
Check the below code. In step 3 where we load dataclass A from yaml string. Note that it does not call __post_init__ method.
import dataclasses
import yaml
#dataclasses.dataclass
class A:
a: int = 55
def __post_init__(self):
print("__post_init__ got called", self)
print("\n>>>>>>>>>>>> 1: create dataclass object")
a = A(33)
print(a) # print dataclass
print(dataclasses.fields(a))
print("\n>>>>>>>>>>>> 2: dump to yaml")
s = yaml.dump(a)
print(s) # print yaml repr
print("\n>>>>>>>>>>>> 3: create class from str")
a_ = yaml.load(s)
print(a_) # print dataclass loaded from yaml str
print(dataclasses.fields(a_))
The solution that I see for now is calling __-post_init__ on my own at the end like in below code snippet.
a_.__post_init__()
I am not sure if this is safe recreation of yaml serialized dataclass. Also, it will pose a problem when __post_init__ takes kwargs in case when dataclass fields are dataclasses.InitVar type.
This behavior is working as intended. You are dumping an existing object, so when you load it pyyaml intentionally avoids initializing the object again. The direct attributes of the dumped object will be saved even if they are created in __post_init__ because that function runs prior to being dumped. When you want the side effects that come from __post_init__, like the print statement in your example, you will need to ensure that initialization occurs.
There are few ways to accomplish this. You can use either the metaclass or adding constructor/representer approaches described in pyyaml's documentation. You could also manually alter the dumped string in your example to be ''!!python/object/new:' instead of ''!!python/object:'. If your eventual goal is to have the yaml file generated in a different manner, then this might be a solution.
See below for an update to your code that uses the metaclass approach and calls __post_init__ when loading from the dumped class object. The call to cls(**fields) in from_yaml ensures that the object is initialized. yaml.load uses cls.__new__ to create objects tagged with ''!!python/object:' and then loads the saved attributes into the object manually.
import dataclasses
import yaml
#dataclasses.dataclass
class A(yaml.YAMLObject):
a: int = 55
def __post_init__(self):
print("__post_init__ got called", self)
yaml_tag = '!A'
yaml_loader = yaml.SafeLoader
#classmethod
def from_yaml(cls, loader, node):
fields = loader.construct_mapping(node, deep=True)
return cls(**fields)
print("\n>>>>>>>>>>>> 1: create dataclass object")
a = A(33)
print(a) # print dataclass
print(dataclasses.fields(a))
print("\n>>>>>>>>>>>> 2: dump to yaml")
s = yaml.dump(a)
print(s) # print yaml repr
print("\n>>>>>>>>>>>> 3: create class from str")
a_ = yaml.load(s, Loader=A.yaml_loader)
print(a_) # print dataclass loaded from yaml str
print(dataclasses.fields(a_))

How to capture the iteration number dyanmically inside test while using pytest-repeat

I'm executing my selenium script multiple times by using pytest-repeat. i need to capture the iteration number during execution and make use of it.
I explored pytest.mark, pytest.collect & pytest.Collector
class Testone():
#pytest.fixture()
def setup(self):
#pytest.mark.repeat(RowCount)
def test_create_eq(self,setup):
Need to capture the iteration number here.
I think there should be an easier and straightforward way than what I describe below. pytest-repeat has a fixture __pytest_repeat_step_number which I hoped could provide the current step number for the test but it did not.
request.node.name provides the name of the test function generated by pytest_repeat and it has the step number which can be extracted for your purpose.
import pytest
class Testone():
#pytest.fixture()
def setup(self):
pass
#pytest.mark.repeat(4)
def test_create_eq(self,setup,request):
current_step = request.node.name.split('[')[1].split('-')[0] #string form; parse to int, if required

Running pytest with thread

I have a question for pytest
I would like to run same pytest script with multiple threads.
But,i am not sure how to create and run thread which is passing more than one param. (And running thread with pytest..)
for example I have
test_web.py
from selenium import webdriver
import pytest
class SAMPLETEST:
self.browser = webdriver.Chrome()
self.browser.get(URL)
self.browser.maximize_window()
def test_title(self):
assert "Project WEB" in self.browser.title
def test_login(self):
print('Testing Login')
ID_BOX = self.broswer.find_element_by_id("ProjectemployeeId")
PW_BOX = self.broswer.find_element_by_id("projectpassword")
ID_BOX.send_keys(self.ID) # this place for ID. This param come from thread_run.py
PW_BOX.send_keys(self.PW) # this place for PW. It is not working. I am not sure how to get this data from threa_run.py
PW_BOX.submit()
IN thread_run.py
import threading
import time
from test_web import SAMPLETEST
ID_List = ["0","1","2","3","4","5","6","7"]
PW_LIST = ["0","1","2","3","4","5","6","7"]
threads = []
print("1: Create thread")
for I in range(8):
print("Append thread" + str(I))
t = threading.Thread(target=SAMPLETEST, args=(ID_List[I], PW_LIST[I]))
threads.append(t)
for I in range(8):
print("Start thread:" + str(I))
threads[I].start()
i was able to run thread to run many SAMPLETEST class without pytest.
However, it is not working with pytest.
My question is.
First, how to initialize self.brower in insde of SAMPLETEST? I am sure below codes will not be working
self.browser = webdriver.Chrome()
self.browser.get(URL)
self.browser.maximize_window()
Second, in thread_run.py, how can i pass the two arguments(ID and Password) when I run thread to call SAMPLTEST on test_web.py?
ID_BOX.send_keys(self.ID) # this place for ID. This param come from thread_run.py
ID_BOX.send_keys(self.ID)
PW_BOX.send_keys(self.PW)
I was trying to build constructor (init) in SAMPLETEST class but it wasn't working...
I am not really sure how to run threads (which passing arguments or parameter ) with pytest.
There are 2 scenarios which i could read from this:
prepare the test data and pass in parameters to your test method which could be achieved by pytest-generate-tests and parameterise concept. You can refer to the documentation here
In case of running pytest in multi threading - Pytest-xdist or pytest-parallel
I had a similar issue and got it resolved by passing an argument in form of a list.
Like:,
I replaced below line
thread_1 = Thread(target=fun1, args=10)
with
thread_1 = Thread(target=fun1, args=[10])

python 3: mock a method of the boto3 S3 client

I want to unit test some code that calls a method of the boto3 s3 client.
I can't use moto because this particular method (put_bucket_lifecycle_configuration) is not yet implemented in moto.
I want to mock the S3 client and assure that this method was called with specific parameters.
The code I want to test looks something like this:
# sut.py
import boto3
class S3Bucket(object):
def __init__(self, name, lifecycle_config):
self.name = name
self.lifecycle_config = lifecycle_config
def create(self):
client = boto3.client("s3")
client.create_bucket(Bucket=self.name)
rules = # some code that computes rules from self.lifecycle_config
# I want to test that `rules` is correct in the following call:
client.put_bucket_lifecycle_configuration(Bucket=self.name, \
LifecycleConfiguration={"Rules": rules})
def create_a_bucket(name):
lifecycle_policy = # a dict with a bunch of key/value pairs
bucket = S3Bucket(name, lifecycle_policy)
bucket.create()
return bucket
In my test, I'd like to call create_a_bucket() (though instantiating an S3Bucket directly is also an option) and make sure that the call to put_bucket_lifecycle_configuration was made with the correct parameters.
I have messed around with unittest.mock and botocore.stub.Stubber but have not managed to crack this. Unless otherwise urged, I am not posting my attempts since they have not been successful so far.
I am open to suggestions on restructuring the code I'm trying to test in order to make it easier to test.
Got the test to work with the following, where ... is the remainder of the arguments that are expected to be passed to s3.put_bucket_lifecycle_configuration().
# test.py
from unittest.mock import patch
import unittest
import sut
class MyTestCase(unittest.TestCase):
#patch("sut.boto3")
def test_lifecycle_config(self, cli):
s3 = cli.client.return_value
sut.create_a_bucket("foo")
 s3.put_bucket_lifecycle_configuration.assert_called_once_with(Bucket="foo", ...)
if __name__ == '__main__':
unittest.main()

Resources