I am setting up the automatic data labelling pipeline for my colleague.
First, I define the ground truth request based on API (bucket, manifests, etc).
Second, I create this labelling job, and all files are uploaded in S3 immediately.
After that my colleague will receive an email saying it is ready to label it, then he will label the data and submit.
Until now, everything is well and quick. Then I check the SageMaker labelling job dashboard, it shows the task is in progress, and it takes very very long time to know it is completed or failed. I don't know the reason. Yesterday, it saved the results at 4 am, took around 6 hours. But if I create label job on website instead of sending requests, it will save the results quickly.
Can anyone explain it? Or maybe I need to set up a time sync or other configuration?
This is my config:
{
"InputConfig": {
"DataSource": {
"S3DataSource": {
"ManifestS3Uri": ""s3://{bucket_name}/{JOB_ID}/{manifest_name}-{JOB_ID}.manifest""
}
},
"DataAttributes": {
"ContentClassifiers": [
"FreeOfPersonallyIdentifiableInformation",
"FreeOfAdultContent"
]
}
},
"OutputConfig": {
"S3OutputPath": "s3://{bucket_name}/{JOB_ID}/output-{manifest_name}/"
},
"HumanTaskConfig": {
"AnnotationConsolidationConfig": {
"AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-east-2:266458841044:function:PRE-TextMultiClass"
},
"PreHumanTaskLambdaArn": "arn:aws:lambda:us-east-2:266458841044:function:PRE-TextMultiClass",
"NumberOfHumanWorkersPerDataObject": 2,
"TaskDescription": "Dear Annotator, please label it according to instructions. Thank you!",
"TaskKeywords": [
"text",
"label"
],
"TaskTimeLimitInSeconds": 600,
"TaskTitle": "Label Text",
"UiConfig": {
"UiTemplateS3Uri": "s3://{bucket_name}/instructions.template"
},
"WorkteamArn": "work team arn"
},
"LabelingJobName": "Label",
"RoleArn": "my role arn",
"LabelAttributeName": "category",
"LabelCategoryConfigS3Uri": ""s3://{bucket_name}/labels.json""
}
I think my Lambda function is wrong, when I change to aws arn (preHuman and Annotation) everything works fine.
This is my afterLabeling Lambda:
import json
import boto3
from urllib.parse import urlparse
def lambda_handler(event, context):
consolidated_labels = []
parsed_url = urlparse(event['payload']['s3Uri']);
s3 = boto3.client('s3')
textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
filecont = textFile['Body'].read()
annotations = json.loads(filecont);
for dataset in annotations:
for annotation in dataset['annotations']:
new_annotation = json.loads(annotation['annotationData']['content'])
label = {
'datasetObjectId': dataset['datasetObjectId'],
'consolidatedAnnotation' : {
'content': {
event['labelAttributeName']: {
'workerId': annotation['workerId'],
'result': new_annotation,
'labeledContent': dataset['dataObject']
}
}
}
}
consolidated_labels.append(label)
return consolidated_labels
Are there any reasons?
Related
I'm new to Python (using 3.7, no particular reason) and setting up some automated reporting for my company. I've tried several different ways of using the response from the Marketing API. This is my starting point:
def account_name():
#Don't worry, I have all of my credentials as global variables, as I'm setting this up on several accounts.
daily_budget = 200
ad_account_id = 'act_xxxxxxxxxxx'
response = (AdAccount(ad_account_id).get_insights(
fields=fields,
params=params,
))
spend = (AdAccount(ad_account_id).get_insights(
fields=fields2
))
print(response)
The response I get is:
[<AdsInsights> {
"action_values": [
{
"action_type": "omni_add_to_cart",
"value": "10489.9"
},
{
"action_type": "omni_purchase",
"value": "8283.81"
}
],
"actions": [
{
"action_type": "omni_add_to_cart",
"value": "1416"
},
{
"action_type": "omni_purchase",
"value": "288"
}
],
"clicks": "1907",
"cpc": "0.477787",
"cpm": "2.984927",
"ctr": "0.62474",
"date_start": "2020-12-14",
"date_stop": "2020-12-14",
"impressions": "305247",
"purchase_roas": [
{
"action_type": "omni_purchase",
"value": "9.091698"
}
],
"reach": "242920",
"spend": "911.14"
}]
The problem I'm having, is that I need to pull the value for "spend" out of the response, with a conditional to make it trigger a webhook if the amount is too high. Everything I've tried either doesn't work with the data type of the response, or gives me the error "'Cursor' object is not callable". An example of what I'm looking for is something to the effect of:
if 'spend' > daily_budget:
print("too high")
And yes, I know the above snippet is not formatted, and will not work, but that's the idea I'm looking for - just don't know how to make it happen. I'm incredibly new to Python, so I'm really stuck here. Thanks for any help!
Figured it out:
metrics = (AdAccount(ad_account_id).get_insights(
fields=fields,
params=params,
))
for i in metrics:
metrics_data = dict(i)
data = {
"Date:": metrics_data.get("date_start"),
"Spend:": metrics_data.get("spend"),
"Reach:": metrics_data.get("reach"),
"Impressions:": metrics_data.get("impressions"),
"Clicks:": metrics_data.get("clicks"),
"CPC:": metrics_data.get("cpc"),
"CTR:": metrics_data.get("ctr"),
"CPM:": metrics_data.get("cpm"),
"ROAS:": metrics_data.get("purchase_roas")
}
Then I can call the "spend" field specifically.
I am trying to execute this elasticsearch query via spark:
POST /aa6/_mtermvectors
{
"ids": [
"ABC",
"XYA",
"RTE"
],
"parameters": {
"fields": [
"attribute"
],
"term_statistics": true,
"offsets": false,
"payloads": false,
"positions": false
}
}
The code that I have written in Zeppelin is :
def createString():String = {
return s"""_mtermvectors {
"ids": [
"ABC",
"XYA",
"RTE"
],
"parameters": {
"fields": [
"attribute"
],
"term_statistics": true,
"offsets": false,
"payloads": false,
"positions": false
}
}"""
}
import org.elasticsearch.spark._
sc.esRDD("aa6", "?q="+createString).count
I get the error :
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: parse_exception: parse_exception: Encountered " <RANGE_GOOP> "["RTE","XYA","ABC" "" at line 1, column 22.
Was expecting:
"TO" ...
{"query":{"query_string":{"query":"_mtermvectors {\"ids\": [\"RTE\",\"ABC\",\"XYA\"], \"parameters\": {\"fields\": [\"attribute\"], \"term_statistics\": true, \"offsets\": false, \"payloads\": false, \"positions\": false } }"}}}
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:477)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:434)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:408)
This is probably something simple but I am unable to find a way to set the request body while making the spark call
I'm not sure but I don't think this is currently supported with es-Spark package. You can check this link to see what options are available via sparkContext of esRDD.
What you could do instead is to make use of High Level Rest Client of Elasticsearch and get the details in a List or Seq or any file and then load that into Spark RDD.
It is round the world way but unfortunately that is the only way I suppose. Just so it helps, I have created the below snippet so you at least have the required data from Elasticsearch related to the above query.
import org.apache.http.HttpHost
import org.elasticsearch.client.RequestOptions
import org.elasticsearch.client.RestClient
import org.elasticsearch.client.RestHighLevelClient
import org.elasticsearch.client.core.MultiTermVectorsRequest
import org.elasticsearch.client.core.TermVectorsRequest
import org.elasticsearch.client.core.TermVectorsResponse
object SampleSparkES {
/**
* Main Class where program starts
*/
def main(args: Array[String]) = {
val termVectorsResponse = elasticRestClient
println(termVectorsResponse.size)
}
/**
* Scala client code to retrieve the response of mtermVectors
*/
def elasticRestClient : java.util.List[TermVectorsResponse] = {
val client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")))
val tvRequestTemplate = new TermVectorsRequest("aa6","ids");
tvRequestTemplate.setFields("attribute");
//Set the document ids you want for collecting the term Vector information
val ids = Array("1", "2", "3");
val request = new MultiTermVectorsRequest(ids, tvRequestTemplate);
val response = client.mtermvectors(request, RequestOptions.DEFAULT)
//get the response
val termVectorsResponse = response.getTermVectorsResponses
//close RestHighLevelClient
client.close();
//return List[TermVectorsResponse]
termVectorsResponse
}
}
As an example you can get the sumDocFreq of the first document in below manner
println(termVectorsResponse.iterator.next.getTermVectorsList.iterator.next.getFieldStatistics.getSumDocFreq)
All you would now need is to find a way to convert the collection into a Seq in a way that could be loaded in an RDD.
I want to create dynamic scheduled queries using Python and I want to publish a message on PubSub when the query completes. I know I can do that from the UI but that's not what I'm looking for.
Currently I'm doing this, but the "notificationPubsubTopic" field gets ignored in the request
import googleapiclient.http
from googleapiclient import discovery, errors
resource = discovery.build("bigquerydatatransfer", "v1")
body = {
"notificationPubsubTopic": "projects/{my_project}/topics/{my_topic}",
"scheduleOptions": {
"disableAutoScheduling": False
},
"disabled": False,
"displayName": "my_table_name",
"dataSourceId": "scheduled_query",
"destinationDatasetId": "test",
"emailPreferences": {
"enableFailureEmail": False
},
"params": {
"query": "select 1",
"write_disposition": "WRITE_TRUNCATE",
"destination_table_name_template": "table_name_test"
},
"schedule": "every day 09:35"
}
creation_job = res.projects().transferConfigs().create(parent=project, body=body)
creation_job.execute()
Google released a few days ago the new version of the datatransfer library adding the support for notification_pubsub_topic when creating transfer configs.
I'm trying to create kubernetes cluster on google cloud platform through python (3.7) using google-cloud-container module.
Created kubernetes cluster through google cloud platform and was able to successfully retrieve details for that cluster using google-cloud container (python module).
I'm trying now to create kubernetes cluster through this module. I created a JSON file with required key values and passed it as parameter, but i'm getting errors. Would appreciate if provided a sample code for creating kubernetes cluster in google cloud platform. Thank you in advance.
from google.oauth2 import service_account
from google.cloud import container_v1
class GoogleCloudKubernetesClient(object):
def __init__(self, file, project_id, project_name, zone, cluster_id):
credentials = service_account.Credentials.from_service_account_file(
filename=file)
self.client = container_v1.ClusterManagerClient(credentials=credentials)
self.project_id = project_id
self.zone = zone
def create_cluster(self, cluster):
print(cluster)
response = self.client.create_cluster(self.project_id, self.zone, cluster=cluster)
print(f"response for cluster creation: {response}")
def main():
cluster_data = {
"name": "test_cluster",
"masterAuth": {
"username": "admin",
"clientCertificateConfig": {
"issueClientCertificate": True
}
},
"loggingService": "logging.googleapis.com",
"monitoringService": "monitoring.googleapis.com",
"network": "projects/abhinav-215/global/networks/default",
"addonsConfig": {
"httpLoadBalancing": {},
"horizontalPodAutoscaling": {},
"kubernetesDashboard": {
"disabled": True
},
"istioConfig": {
"disabled": True
}
},
"subnetwork": "projects/abhinav-215/regions/us-west1/subnetworks/default",
"nodePools": [
{
"name": "test-pool",
"config": {
"machineType": "n1-standard-1",
"diskSizeGb": 100,
"oauthScopes": [
"https://www.googleapis.com/auth/cloud-platform"
],
"imageType": "COS",
"labels": {
"App": "web"
},
"serviceAccount": "abhinav#abhinav-215.iam.gserviceaccount.com",
"diskType": "pd-standard"
},
"initialNodeCount": 3,
"autoscaling": {},
"management": {
"autoUpgrade": True,
"autoRepair": True
},
"version": "1.11.8-gke.6"
}
],
"locations": [
"us-west1-a",
"us-west1-b",
"us-west1-c"
],
"resourceLabels": {
"stage": "dev"
},
"networkPolicy": {},
"ipAllocationPolicy": {},
"masterAuthorizedNetworksConfig": {},
"maintenancePolicy": {
"window": {
"dailyMaintenanceWindow": {
"startTime": "02:00"
}
}
},
"privateClusterConfig": {},
"databaseEncryption": {
"state": "DECRYPTED"
},
"initialClusterVersion": "1.11.8-gke.6",
"location": "us-west1-a"
}
kube = GoogleCloudKubernetesClient(file='/opt/key.json', project_id='abhinav-215', zone='us-west1-a')
kube.create_cluster(cluster_data)
if __name__ == '__main__':
main()
Actual Output:
Traceback (most recent call last):
File "/opt/matilda_linux/matilda_linux_logtest/matilda_discovery/matilda_discovery/test/google_auth.py", line 118, in <module>
main()
File "/opt/matilda_linux/matilda_linux_logtest/matilda_discovery/matilda_discovery/test/google_auth.py", line 113, in main
kube.create_cluster(cluster_data)
File "/opt/matilda_linux/matilda_linux_logtest/matilda_discovery/matilda_discovery/test/google_auth.py", line 31, in create_cluster
response = self.client.create_cluster(self.project_id, self.zone, cluster=cluster)
File "/opt/matilda_discovery/venv/lib/python3.6/site-packages/google/cloud/container_v1/gapic/cluster_manager_client.py", line 407, in create_cluster
project_id=project_id, zone=zone, cluster=cluster, parent=parent
ValueError: Protocol message Cluster has no "masterAuth" field.
Kind of late answer, but I had the same problem and figured it out. Worth writing for future viewers.
You should not write the field names in the cluster_data as they appear in the REST API.
Instead you should translate them to how they would look by python convention, with words separated by underline instead of camelcase.
Thus, instead of writing masterAuth, you should write master_auth. You should make similar changes to the rest of your fields, and then the script should work.
P.S you aren't using the project_name and cluster_id params in GoogleCloudKubernetesClient.init. Not sure what they are, but you should probably remove them.
The module is still using the basic REST API format to create the cluster. You can also use the GUI to choose all the options you want to use for your cluster, then press on the REST hyperlink at the bottom of the page, this will provide you with the REST format required to build the cluster you want.
The error you are getting is because you have a blank (or unspecified) field that must be specified. Some of the fields listed on the API have default values that you don't need, others are required.
I have a Lambda function in Node.js that processes new images added to my bucket. I want to run the function for all existing objects. How can I do this? I figured the easiest way is to "re-put" each object, to trigger the function, but I'm not sure how to do this.
To be clear - I want to run, one-time, on each of the existing objects. The trigger is already working for new objects, I just need to run it on the objects that were inserted before the lambda function was created.
The following Lambda function will do what you require.
It will iterate through each file in your target S3 bucket and for each it will execute the desired lambda function against it emulating a put operation.
You're probably going to want to put a very long execution time allowance against this function
var TARGET_BUCKET="my-bucket-goes-here";
var TARGET_LAMBDA_FUNCTION_NAME="TestFunct";
var S3_PUT_SIMULATION_PARAMS={
"Records": [
{
"eventVersion": "2.0",
"eventTime": "1970-01-01T00:00:00.000Z",
"requestParameters": {
"sourceIPAddress": "127.0.0.1"
},
"s3": {
"configurationId": "testConfigRule",
"object": {
"eTag": "0123456789abcdef0123456789abcdef",
"sequencer": "0A1B2C3D4E5F678901",
"key": "HappyFace.jpg",
"size": 1024
},
"bucket": {
"arn": "arn:aws:s3:::mybucket",
"name": "sourcebucket",
"ownerIdentity": {
"principalId": "EXAMPLE"
}
},
"s3SchemaVersion": "1.0"
},
"responseElements": {
"x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH",
"x-amz-request-id": "EXAMPLE123456789"
},
"awsRegion": "us-east-1",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "EXAMPLE"
},
"eventSource": "aws:s3"
}
]
};
var aws = require('aws-sdk');
var s3 = new aws.S3();
var lambda = new aws.Lambda();
exports.handler = (event, context, callback) => {
retrieveS3BucketContents(TARGET_BUCKET, function(s3Objects){
simulateS3PutOperation(TARGET_BUCKET, s3Objects, simulateS3PutOperation, function(){
console.log("complete.");
});
});
};
function retrieveS3BucketContents(bucket, callback){
s3.listObjectsV2({
Bucket: TARGET_BUCKET
}, function(err, data) {
callback(data.Contents);
});
}
function simulateS3PutOperation(bucket, s3ObjectStack, callback, callbackEmpty){
var params = {
FunctionName: TARGET_LAMBDA_FUNCTION_NAME,
Payload: ""
};
if(s3ObjectStack.length > 0){
var s3Obj = s3ObjectStack.pop();
var p = S3_PUT_SIMULATION_PARAMS;
p.Records[0].s3.bucket.name = bucket;
p.Records[0].s3.object.key = s3Obj.Key;
params.Payload = JSON.stringify(p, null, 2);
lambda.invoke(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else{
callback(bucket, s3ObjectStack, callback, callbackEmpty);
}
});
}
else{
callbackEmpty();
}
}
Below is the full policy that your lambda query will need to execute this method, it allows R/W to CloudWatch logs and ListObject access to S3. You need to fill in your bucket details where you see MY-BUCKET-GOES-HERE
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1477382207000",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::MY-BUCKET-GOES-HERE/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
This thread helped push me in the right direction as I needed to invoke a lambda function per file for an existing 50k files in two buckets. I decided to write it in python and limit the amount of lambda functions running simultaneously to 500 (the concurrency limit for many aws regions is 1000).
The script creates a worker pool of 500 threads who feed off a queue of bucket keys. Each worker waits for their lambda to be finished before picking up another. Since the execution of this script against my 50k files will take a couple hours, I'm just running it off my local machine. Hope this helps someone!
#!/usr/bin/env python
# Proper imports
import json
import time
import base64
from queue import Queue
from threading import Thread
from argh import dispatch_command
import boto3
from boto.s3.connection import S3Connection
client = boto3.client('lambda')
def invoke_lambdas():
try:
# replace these with your access keys
s3 = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
buckets = [s3.get_bucket('bucket-one'), s3.get_bucket('bucket-two')]
queue = Queue()
num_threads = 500
# create a worker pool
for i in range(num_threads):
worker = Thread(target=invoke, args=(queue,))
worker.setDaemon(True)
worker.start()
for bucket in buckets:
for key in bucket.list():
queue.put((bucket.name, key.key))
queue.join()
except Exception as e:
print(e)
def invoke(queue):
while True:
bucket_key = queue.get()
try:
print('Invoking lambda with bucket %s key %s. Remaining to process: %d'
% (bucket_key[0], bucket_key[1], queue.qsize()))
trigger_event = {
'Records': [{
's3': {
'bucket': {
'name': bucket_key[0]
},
'object': {
'key': bucket_key[1]
}
}
}]
}
# replace lambda_function_name with the actual name
# InvocationType='RequestResponse' means it will wait until the lambda fn is complete
response = client.invoke(
FunctionName='lambda_function_name',
InvocationType='RequestResponse',
LogType='None',
ClientContext=base64.b64encode(json.dumps({}).encode()).decode(),
Payload=json.dumps(trigger_event).encode()
)
if response['StatusCode'] != 200:
print(response)
except Exception as e:
print(e)
print('Exception during invoke_lambda')
queue.task_done()
if __name__ == '__main__':
dispatch_command(invoke_lambdas)
As I had to do this on a very large bucket, and lambda functions have a max. execution time of 10 minutes, I ended up doing a script with the Ruby AWS-SDK.
require 'aws-sdk-v1'
class LambdaS3Invoker
BUCKET_NAME = "HERE_YOUR_BUCKET"
FUNCTION_NAME = "HERE_YOUR_FUNCTION_NAME"
AWS_KEY = "HERE_YOUR_AWS_KEY"
AWS_SECRET = "HERE_YOUR_AWS_SECRET"
REGION = "HERE_YOUR_REGION"
def execute
bucket.objects({ prefix: 'products'}).each do |o|
lambda_invoke(o.key)
end
end
private
def lambda_invoke(key)
lambda.invoke({
function_name: FUNCTION_NAME,
invocation_type: 'Event',
payload: JSON.generate({
Records: [{
s3: {
object: {
key: key,
},
bucket: {
name: BUCKET_NAME,
}
}
}]
})
})
end
def lambda
#lambda ||= Aws::Lambda::Client.new(
region: REGION,
access_key_id: AWS_KEY,
secret_access_key: AWS_SECRET
)
end
def resource
#resource ||= Aws::S3::Resource.new(
access_key_id: AWS_KEY,
secret_access_key: AWS_SECRET
)
end
def bucket
#bucket ||= resource.bucket(BUCKET_NAME)
end
end
And then you can call it like:
LambdaS3Invoker.new.execute
What you need to do is create a one time script which uses AWS SDK to invoke your lambda function. This solution doesn't require you to "re-put" the object.
I am going to base my answer on AWS JS SDK.
To be clear - I want to run, one-time, on each of the existing
objects. The trigger is already working for new objects, I just need
to run it on the objects that were inserted before the lambda function
was created.
As you have a working lambda function which accepts S3 put events what you need to do is find all the unprocessed object in S3 (If you have DB entries for each S3 object the above should be easy if not then you might find the S3 list object function handy http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property).
Then for each unprocessed S3 object obtained create a JSON object which looks like S3 Put Event Message(shown below) and call the Lambda invoke function with the above JSON object as payload.
You can find the lambda invoke function docs at http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html#invoke-property
When creating the fake S3 Put Event Message Object for your lambda function you can ignore most of the actual object properties depending on your lambda function. I guess the least you will have to set is bucket name and object key.
S3 Put Event Message Structure http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html
{
"Records":[
{
"eventVersion":"2.0",
"eventSource":"aws:s3",
"awsRegion":"us-east-1",
"eventTime":"1970-01-01T00:00:00.000Z",
"eventName":"ObjectCreated:Put",
"userIdentity":{
"principalId":"AIDAJDPLRKLG7UEXAMPLE"
},
"requestParameters":{
"sourceIPAddress":"127.0.0.1"
},
"responseElements":{
"x-amz-request-id":"C3D13FE58DE4C810",
"x-amz-id-2":"FMyUVURIY8/IgAtTv8xRjskZQpcIZ9KG4V5Wp6S7S/JRWeUWerMUE5JgHvANOjpD"
},
"s3":{
"s3SchemaVersion":"1.0",
"configurationId":"testConfigRule",
"bucket":{
"name":"mybucket",
"ownerIdentity":{
"principalId":"A3NL1KOZZKExample"
},
"arn":"arn:aws:s3:::mybucket"
},
"object":{
"key":"HappyFace.jpg",
"size":1024,
"eTag":"d41d8cd98f00b204e9800998ecf8427e",
"versionId":"096fKKXTRTtl3on89fVO.nfljtsv6qko",
"sequencer":"0055AED6DCD90281E5"
}
}
}
]
}
well basically what you need is to use some api calls(boto for example if you use python)and list all new objects or all objects in your s3 bucket and then process these objects
here is a snippet:
from boto.s3.connection import S3Connection
conn = S3Connection()
source = conn.get_bucket(src_bucket)
src_list = set([key.name for key in source.get_all_keys(headers=None, prefix=prefix)])
//and then you can go over this src list
for entry in src_list:
do something