Is there any way or workaround to schedule Amazon Mechanical Turk HITs? - amazon

I need a specific HIT to run every Friday morning. Is there any way to do this or any workaround with an external platform (IFTTT, zapier both don't work) to do this? It seems to me like a very fundamental feature.

FWIW, I figured out how to use Zapier with MTurk. If you are on a paid plan you can leverage the AWS Lambda app to trigger some code that will create a HIT on MTurk. To do this you need an AWS account that's linked to your MTurk account. Once you have that you can create a Lambda function that contains the following code for creating a HIT on MTurk:
import json
import boto3
def lambda_handler(event, context):
print(event)
###################################
# Step 1: Create a client
###################################
endpoint = "https://mturk-requester.us-east-1.amazonaws.com"
mturk = boto3.client(
service_name='mturk',
region_name='us-east-1',
endpoint_url=endpoint)
###################################
# Step 2: Define the task
###################################
html = '''
<**********************************
My task HTML
***********************************>
'''.format(event['<my parameter>'])
question_xml = '''
<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
<HTMLContent><![CDATA[{}]]></HTMLContent>
<FrameHeight>0</FrameHeight>
</HTMLQuestion>'''.format(html)
task_attributes = {
'MaxAssignments': 3,
'LifetimeInSeconds': 60 * 60 * 5, # Stay active for 5 hours
'AssignmentDurationInSeconds': 60 * 10, # Workers have 10 minutes to respond
'Reward': '0.03',
'Title': '<Task title>',
'Keywords': '<keywords>',
'Description': '<Task description>'
}
###################################
# Step 3: Create the HIT
###################################
response = mturk.create_hit(
**task_attributes,
Question=question_xml
)
hit_type_id = response['HIT']['HITTypeId']
print('Created HIT {} in HITType {}'.format(response['HIT']['HITId'], hit_type_id))
Note you'll need to give the role your Lambda is using access to MTurk. From there you can create an IAM user for Zapier to use when calling your Lambda and link it to your Zapier account. Now you can setup your Action to call that Lambda function with whatever parameters you want to pass in the event.
If you want to get the results of the HIT back into your Zap it will be more complicated because Zapier isn't well suited to the asynchronous nature of MTurk HITs. I've put together a blog post on how to do this below:
https://www.daveschultzconsulting.com/2019/07/18/using-mturk-with-zapier/

There is no built-in feature in the MTurk API to accomplish scheduled launch of HITs. It must be done through custom programming.
If you are looking for a turn-key solution, scheduling can be done via TurkPrime using the Scheduled Launch Time found in tab 5 (Setup Hit and Payments)

Related

Is it possible to get all the history of reccuring google tasks with google tasks API?

I have a tasklist with repeating tasks inside. I managed to retrieve the tasks via google tasks API, but only for the current day. From the documentation I find it impossible to retrieve all the history of reccuring tasks. Am I missing something?
The idea is to create a script to keep a count of how successful a person is in doing his dailies, keeping a streak, maybe make a chart to see ratio of completed/not completed etc...
I used the google quickstart code and modified it a bit, but as much as I try, I can only get the reccuring tasks from that day... Any help?
# Call the Tasks API
results = service.tasks().list(tasklist='b09LaUd1MzUzY3RrOGlSUg', showCompleted=True,
showDeleted=False, showHidden=True).execute()
items = results.get('items', [])
if not items:
print('No task lists found.')
return
print('Task lists:')
for item in items:
print(item)

How do I get the name of a flow from within a task in Prefect?

I'm not sure if this is possible, since flow names are assigned later when a flow is actually run (aka, "creepy-lemur" or whatnot), but I'd like to define a Prefect task within a flow and have that task collect the name of the flow that ran it, so I can insert it into a database table. Has anyone figured out how to do this?
You can get the flow run name and ID from the context:
import prefect
from prefect import task, flow
#task
def print_task_context():
print("Task run context:")
print(prefect.context.get_run_context().task_run.dict())
#flow
def main_flow():
print_task_context()
print("Flow run context:")
print(prefect.context.get_run_context().flow_run.dict())
if __name__ == "__main__":
main_flow()
Here are more resources on Prefect Discourse about setting custom run names:
https://discourse.prefect.io/tag/task_run_name
https://discourse.prefect.io/tag/flow_run_name

how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function

I have a requirement to create a cloud function which should check for a set of files in a GCS bucket and if all of those files arrives in GCS bucket then only it should launch the dataflow templates for all those files.
My existing cloud function code launches cloud dataflow for each file which comes into a GCS bucket. It runs different dataflows for different files based on naming convention. This existing code is working fine but my intention is not to trigger dataflow for each uploaded file directly.
It should check for set of files and if all the files arrives, then it should launch dataflows for those files.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result ?
from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address',
'Customer_service_ticket'
]
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
I've written the below code for the above functionality. The cloud function is running without any error but it is not triggering any dataflow. Not sure what is happening as the logs has no error.
from googleapiclient.discovery import build
import time
import os
def df_load_function(file, context):
filesnames = [
'Customer_',
'Customer_Address_',
'Customer_service_ticket_'
]
paths =['Customer_','Customer_Address_','Customer_service_ticket_']
for path in paths :
if os.path.exists('gs://xxx/inbound/')==True :
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = 'xxx'
inputfile = 'gs://xxx/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
location = 'asia-south1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerRegion": "asia-south1",
"tempLocation": "gs://xxx/temp"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
else:
exit()
Could someone please help me with the above python code.
Also my file names contain current dates at the end as these are incremental files which I get from different source teams.
If I'm understanding your question correctly, the easiest thing to do is to write basic logic in your function that determines if the entire set of files is present. If not, exit the function. If yes, run the appropriate Dataflow pipeline. Basically implementing what you wrote in your first paragraph as Python code.
If it's a small set of files it shouldn't be an issue to have a function run on each upload to check set completeness. Even if it's, for example, 10,000 files a month the cost is extremely small for this service assuming:
Your function isn't using lots of bandwidth to transfer data
The code for each function invocation doesn't take a long time to run.
Even in scenarios where you can't meet these requirements Functions is still pretty cheap to run.
If you're worried about costs I would recommend checking out the Google Cloud Pricing Calculator to get an estimate.
Edit with updated code:
I would highly recommend using the Google Cloud Storage Python client library for this. Using os.path likely won't work as there are additional underlying steps required to search a bucket...and probably more technical details there than I fully understand.
To use the Python client library, add google-cloud-storage to your requirements.txt. Then, use something like the following code to check the existence of an object. This example is based off an HTTP trigger, but the gist of the code to check object existence is the same.
from google.cloud import storage
import os
def hello_world(request):
# Instantiate GCS client
client = storage.client.Client()
# Instantiate bucket definition
bucket = storage.bucket.Bucket(client, name="bucket-name")
# Search for object
for file in filenames:
if storage.blob.Blob(file, bucket) and "name_modifier" in file:
# Run name_modifier Dataflow job
elif storage.blob.Blob(file, bucket) and "name_modifier_2" in file:
# Run name_modifier_2 Dataflow job
else:
return "File not found"
This code ins't exactly what you want from a logic standpoint, but should get you started. You'll probably want to just make sure all of the objects can be found first and then move to another step where you start running the corresponding Dataflow jobs for each file if they are all found in the previous step.

Group together celery results

TL:DR
I want to lable results in the backend.
I have a flask/celery project and I'm new to celery.
A user sends in a batch of tasks for celery to work on.
Celery saves the results to a backend SQL database (table automatically created by Celery, named celery_taskmeta).
I want to let the user see the status of his batch, and request the results from the backend.
My problem is that all the results are in one table. What are my options to lable this batch, so the user can differentiate the batches?
My ideas:
Can I add a lable to each task, e.g. "Bob's batch no. 12" and then query celery_taskmeta for that?
Can I put each batch in named backend tables, so ask Celery to save results to a table named task_12?
Trying with groups
I've tried the following code to group the results
job_group = group(api_get.delay(url) for url in urllist)
But I don't see any way to identify the group in the backend/results DB
Trying with task name
In the backend I see an empty column header 'name' so I thought I could add an arbitrary string there:
#app.task(name="an amazing vegetable")
def api_get(url: str) -> tuple:
...
But then the celery worker throws an error when I run the task:
KeyError: 'an amazing vegetable'
[2020-12-18 12:07:22,713: ERROR/MainProcess] Received unregistered task of type 'an amazing vegetable'.
Probably the simplest solution is to use Group and use the Group Result to periodically poll for group state.
A1: As for the label question - yes, you can "label" your task by using the custom state feature.
A2: you can hack around to put each batch of tasks inside backend table, but I strongly advise not to mess with it. If you really want to go this route, make a separate database for this particular use.

Azure Logic Apps pricing

I have the following logic app:
Briefly, When a file is created on OneDrive (the trigger):
If the file content-type is not 'application/pdf', then the app terminates.
If the file content-type is 'application/pdf' it sends an email & then deletes the file from OneDrive.
The above logic app has 1 trigger & 4 actions.
From MS we have the following pricing:
PRICE PER EXECUTION
Actions €0.000022
Standard Connector €0.000106
Enterprise Connector €0.000844
From my understanding and if I am not mistaken, the plan has 2 Standard Actions (Send email & Delete file) and 2 Built-in actions (the Condition & the Terminate one).
My questions are the following:
If the file on the OneDrive is not PDF (and as such runs the Condition and then the Terminate actions) will MS charge only the 2 Built-in actions (2 * €0.000022) or it will charge the 1 built-in (condition) + the 2 standard ones (€0.000022 + 2 * €0.000106) or directly all the actions of the plan which are 2 built-in + 2 standard ones (2 * €0.000022 + 2 * €0.000106)?
Aren't the triggers being charged at all?
Is it occurring any charge when the plan execution is skipped (like bellow) in case no file is created on OneDrive? (keep in mind that the trigger checks for items [new files] every 1 min.).
If the file on the OneDrive is not PDF
According to my understand, your logic app will execute When a file is crested(€0.000022) + 'Condition'(€0.000022) + Termiate(€0.000022).
Aren't the triggers being charged at all?
According to my understand, Trigger is regarded as a special action, each execution will cost €0.000022.
Is it occurring any charge when the plan execution is skipped (like bellow) in case no file is created on OneDrive?
You can click in to check the operation and check whether the Trigger is triggered, check if the right side of the action is ticked. If Trigger is executed (the right side is ticked), charges will be charged for each execution of Trigger.
By the way, regarding this issue, you'd better seek official technical support(free of charge), and the answer is more authoritative.

Resources