How do I get the name of a flow from within a task in Prefect? - prefect

I'm not sure if this is possible, since flow names are assigned later when a flow is actually run (aka, "creepy-lemur" or whatnot), but I'd like to define a Prefect task within a flow and have that task collect the name of the flow that ran it, so I can insert it into a database table. Has anyone figured out how to do this?

You can get the flow run name and ID from the context:
import prefect
from prefect import task, flow
#task
def print_task_context():
print("Task run context:")
print(prefect.context.get_run_context().task_run.dict())
#flow
def main_flow():
print_task_context()
print("Flow run context:")
print(prefect.context.get_run_context().flow_run.dict())
if __name__ == "__main__":
main_flow()
Here are more resources on Prefect Discourse about setting custom run names:
https://discourse.prefect.io/tag/task_run_name
https://discourse.prefect.io/tag/flow_run_name

Related

how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function

I have a requirement to create a cloud function which should check for a set of files in a GCS bucket and if all of those files arrives in GCS bucket then only it should launch the dataflow templates for all those files.
My existing cloud function code launches cloud dataflow for each file which comes into a GCS bucket. It runs different dataflows for different files based on naming convention. This existing code is working fine but my intention is not to trigger dataflow for each uploaded file directly.
It should check for set of files and if all the files arrives, then it should launch dataflows for those files.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result ?
from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address',
'Customer_service_ticket'
]
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
I've written the below code for the above functionality. The cloud function is running without any error but it is not triggering any dataflow. Not sure what is happening as the logs has no error.
from googleapiclient.discovery import build
import time
import os
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address_',
'Customer_service_ticket_'
]
paths =['Customer_','Customer_Address_','Customer_service_ticket_']
for path in paths :
if os.path.exists('gs://xxx/inbound/')==True :
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
else:
exit()
Could someone please help me with the above python code.
Also my file names contain current dates at the end as these are incremental files which I get from different source teams.
If I'm understanding your question correctly, the easiest thing to do is to write basic logic in your function that determines if the entire set of files is present. If not, exit the function. If yes, run the appropriate Dataflow pipeline. Basically implementing what you wrote in your first paragraph as Python code.
If it's a small set of files it shouldn't be an issue to have a function run on each upload to check set completeness. Even if it's, for example, 10,000 files a month the cost is extremely small for this service assuming:
Your function isn't using lots of bandwidth to transfer data
The code for each function invocation doesn't take a long time to run.
Even in scenarios where you can't meet these requirements Functions is still pretty cheap to run.
If you're worried about costs I would recommend checking out the Google Cloud Pricing Calculator to get an estimate.
Edit with updated code:
I would highly recommend using the Google Cloud Storage Python client library for this. Using os.path likely won't work as there are additional underlying steps required to search a bucket...and probably more technical details there than I fully understand.
To use the Python client library, add google-cloud-storage to your requirements.txt. Then, use something like the following code to check the existence of an object. This example is based off an HTTP trigger, but the gist of the code to check object existence is the same.
from google.cloud import storage
import os
def hello_world(request):
# Instantiate GCS client
client = storage.client.Client()
# Instantiate bucket definition
bucket = storage.bucket.Bucket(client, name="bucket-name")
# Search for object
for file in filenames:
if storage.blob.Blob(file, bucket) and "name_modifier" in file:
# Run name_modifier Dataflow job
elif storage.blob.Blob(file, bucket) and "name_modifier_2" in file:
# Run name_modifier_2 Dataflow job
else:
return "File not found"
This code ins't exactly what you want from a logic standpoint, but should get you started. You'll probably want to just make sure all of the objects can be found first and then move to another step where you start running the corresponding Dataflow jobs for each file if they are all found in the previous step.

Group together celery results

TL:DR
I want to lable results in the backend.
I have a flask/celery project and I'm new to celery.
A user sends in a batch of tasks for celery to work on.
Celery saves the results to a backend SQL database (table automatically created by Celery, named celery_taskmeta).
I want to let the user see the status of his batch, and request the results from the backend.
My problem is that all the results are in one table. What are my options to lable this batch, so the user can differentiate the batches?
My ideas:
Can I add a lable to each task, e.g. "Bob's batch no. 12" and then query celery_taskmeta for that?
Can I put each batch in named backend tables, so ask Celery to save results to a table named task_12?
Trying with groups
I've tried the following code to group the results
job_group = group(api_get.delay(url) for url in urllist)
But I don't see any way to identify the group in the backend/results DB
Trying with task name
In the backend I see an empty column header 'name' so I thought I could add an arbitrary string there:
#app.task(name="an amazing vegetable")
def api_get(url: str) -> tuple:
...
But then the celery worker throws an error when I run the task:
KeyError: 'an amazing vegetable'
[2020-12-18 12:07:22,713: ERROR/MainProcess] Received unregistered task of type 'an amazing vegetable'.
Probably the simplest solution is to use Group and use the Group Result to periodically poll for group state.
A1: As for the label question - yes, you can "label" your task by using the custom state feature.
A2: you can hack around to put each batch of tasks inside backend table, but I strongly advise not to mess with it. If you really want to go this route, make a separate database for this particular use.

Adding records into table while running the db init

I am using flask,sqlalchemy,sqlite and python for my application. When I run the db init for creating the database I want some default set of values hase to be added into the database. I have tried these two things to add the records into the table. One method is using 'event'.
from sqlalchemy.event import listen
from sqlalchemy import event, DDL
#event.listens_for(studentStatus.__table__, 'after_create')
def insert_initial_values(*args, **kwargs):
db.session.add(studentStatus(status_name='Waiting on admission'))
db.session.add(studentStatus(status_name='Waiting on student'))
db.session.add(studentStatus(status_name='Interaction Initiated'))
db.session.commit()
When I run
python manage_db.py db init,
python manage_db.py db migrate,
python manage_db.py db upgrade
I didnt get any issues but the records are not getting created.
The another method I tried is ,in the models.py I have included
record_for_student_status = studentStatus(status_name="Waiting on student")
db.session.add(record_for_student_status)
db.session.commit()
print(record_for_student_status)
The Class Model code:
class StudentStatus(db.Model):
status_id = db.Column(db.Integer,primary_key=True)
status_name = db.Column(db.String)
def __repr__(self):
return f"studentStatus('{self.status_id}','{self.status_name}')"
When I run python manage_db.py db init ,I am getting an error Student_status,there is no such table.
Can some one help me how to add the default values to the student_status table when I run db init?
I have tried with flask-seeder also. I have installed flask seeder nd added one new file called seeds.py and I ran the flask seed run
My seeds.py looks like this
class studentStatus(db.Model):
def __init__(self, status_id=None, status_name=None):
self.status_id = db.Column(db.Integer,primary_key=True)
self.status_name = db.Column(db.String,status_name)
def __str__(self):
return "ID=%d, Name=%s" % (self.status_id, self.status_name)
class DemoSeeder(Seeder):
# run() will be called by Flask-Seeder
def run(self):
# Create a new Faker and tell it how to create User objects
faker = Faker(
cls=studentStatus,
init={
"status_id": 1,
"name": "Waiting on Admission"
}
)
# Create 5 users
for user in faker.create(5):
print("Adding user: %s" % user)
self.db.session.add(user)
I ran db init,db migrate and db upgrade. I didnt get any issues. When I ran flask seed run I am getting this error
Error: Could not locate a Flask application. You did not provide the "FLASK_APP" environment variable, and a "wsgi.py" or "app.py" module was not found in the current directory
I googled it and i tried
export FLASK_APP=seeds.py
Then I ran the flask seed run ,I am getting an error ``` error could not import seeds.py```
Please help me in this.
What I need finally is when I do db initialization for the first time some default value have to be added to the database.
Before running the flask application you should instantiate the FLASK_APP and FLASK_ENV.
First Create a function for adding data. assume that the function name is run like yours,
def run(self):
# Create a new Faker and tell it how to create User objects
faker = Faker(
cls=studentStatus,
init={
"status_id": 1,
"name": "Waiting on Admission"
}
)
# Create 5 users
for user in faker.create(5):
print("Adding user: %s" % user)
self.db.session.add(user)
Now you can call this function in the app running file. Assume that my file has named as app.py.
from application import app
if __name__ == '__main__':
run()
app.run(debug=True)
Now set the FLASK_APP and FLASK_ENV like below.
export FLASK_APP=app.py
export FLASK_ENV=development
After that, you can run the below commands.
flask db init
flask db migrate
flask db upgrade
this will create all the tables and the data what you want.
For more details, you can get from the below link. This link is for a Github repository. There I have done adding some data when app initiating. Just read the readme.MD then you can get some more insight about setting up the FLASK_APP.
Flask Data Adding when DB instantiating
humm, seems like you put a underscore in Student_status, try to remove it. And other thing, try to write your class in CamelCase, is more "Pythonic" way to write a code, when you run it, the SQLalchemy will transform this class name to student_status table automatically

Is there any way or workaround to schedule Amazon Mechanical Turk HITs?

I need a specific HIT to run every Friday morning. Is there any way to do this or any workaround with an external platform (IFTTT, zapier both don't work) to do this? It seems to me like a very fundamental feature.
FWIW, I figured out how to use Zapier with MTurk. If you are on a paid plan you can leverage the AWS Lambda app to trigger some code that will create a HIT on MTurk. To do this you need an AWS account that's linked to your MTurk account. Once you have that you can create a Lambda function that contains the following code for creating a HIT on MTurk:
import json
import boto3
def lambda_handler(event, context):
print(event)
###################################
# Step 1: Create a client
###################################
endpoint = "https://mturk-requester.us-east-1.amazonaws.com"
mturk = boto3.client(
service_name='mturk',
region_name='us-east-1',
endpoint_url=endpoint)
###################################
# Step 2: Define the task
###################################
html = '''
<**********************************
My task HTML
***********************************>
'''.format(event['<my parameter>'])
question_xml = '''
<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
<HTMLContent><![CDATA[{}]]></HTMLContent>
<FrameHeight>0</FrameHeight>
</HTMLQuestion>'''.format(html)
task_attributes = {
'MaxAssignments': 3,
'LifetimeInSeconds': 60 * 60 * 5, # Stay active for 5 hours
'AssignmentDurationInSeconds': 60 * 10, # Workers have 10 minutes to respond
'Reward': '0.03',
'Title': '<Task title>',
'Keywords': '<keywords>',
'Description': '<Task description>'
}
###################################
# Step 3: Create the HIT
###################################
response = mturk.create_hit(
**task_attributes,
Question=question_xml
)
hit_type_id = response['HIT']['HITTypeId']
print('Created HIT {} in HITType {}'.format(response['HIT']['HITId'], hit_type_id))
Note you'll need to give the role your Lambda is using access to MTurk. From there you can create an IAM user for Zapier to use when calling your Lambda and link it to your Zapier account. Now you can setup your Action to call that Lambda function with whatever parameters you want to pass in the event.
If you want to get the results of the HIT back into your Zap it will be more complicated because Zapier isn't well suited to the asynchronous nature of MTurk HITs. I've put together a blog post on how to do this below:
https://www.daveschultzconsulting.com/2019/07/18/using-mturk-with-zapier/
There is no built-in feature in the MTurk API to accomplish scheduled launch of HITs. It must be done through custom programming.
If you are looking for a turn-key solution, scheduling can be done via TurkPrime using the Scheduled Launch Time found in tab 5 (Setup Hit and Payments)

Jenkins: How to get a users LDAP groups in groovy-script

i have setup a parametrized job for self-service deployments in Jenkins.
Users can select a version of the application and the environment to deploy to.
The available environments displayed to the user is currently just a static list of strings (choice parameter).
Now i want to restrict deployments to some environments based on the LDAP-groups of the current user.
The user-page in jenkins displays something like:
Jenkins Benutzer Id: maku
Groups:
adm_proj_a
nexus_admin
ROLE_ADM_PROJ_XY
ROLE_BH_KK
How do i get these groups within a groovy-script?
I tried to use dynamic choice parameter (scriptler) and get the LDAP-groups using a groovy-script but did not find my way through the Jenkins-API.
Any hints welcome
User.getAuthorities() requires the caller to have the ADMINISTER permission. (http://javadoc.jenkins-ci.org/hudson/model/User.html#getAuthorities())
An alternative is to query the SecurityRealm directly.
import hudson.model.*
import jenkins.model.*
def userid = User.current().id
def auths = Jenkins.instance.securityRealm.loadUserByUsername(userid)
.authorities.collect{a -> a.authority}
if("adm_proj_a" in auths){
...
I found a solution. Just in case anybody is interested:
Within scriptler i created a script groovy-script similar to this:
import hudson.model.*
def allowed_environments = ["dev","test","test-integration"]
if ("adm_proj_a" in User.current().getAuthorities() )
{
allowed_environments.add("production")
}
return allowed_environments
This script is used by dynamic choice parameter (scriptler) within my Jenkins-Job.
Now only users within the group adm_proj_a can see production as a choice.
Like ffghfgh wrote getAuthorities method requires administrator permission.Use the following:
def auth = hudson.model.User.current().impersonate().getAuthorities().collect {it.getAuthority()}
if ("adm_proj_a" in auth){
// do something
}
Jenkins may ask admin account to approve script in "scriptApproval" section

Resources