How to get results of AWS Glue Job when executing via API? - python-3.x

I executed an AWS Glue Job via API Gateway to start the job run. The job run is successful. But the result of the Script (print of a result) has not gotten through the execution. Only job run ID comes as the response. Is there any way to get the result of the job through an API?

For glue anything you print or log goes into cloud watch
You have an option of adding a handler in your logger that writes to a stream and push that stream to a file in s3. Or better yet, create a StringIO object , store your result to it and then send that to s3

Related

Troubleshoot DynamoDB to Elastic Search

Let's suppose I have a database on DynamoDB, and I am currently using streams and lambda functions to send that data to Elasticsearch.
Here's the thing, supposing the data is saved successfully on DynamoDB, is there a way for me to be 100% sure that the data has been saved on Elasticsearch as well?
Considering I have a function to save that data on DDB is there a way for me communicate with the lambda function triggered by DDB before returning a status code answer, so I can receive confirmation before returning?
I want to do that in order to return ok both from my function and the lambda function at the same time.
This doesn't look like the correct approach for this problem. We generally use DynamoDB Streams + Lambda for operations that are async in nature and when we don't have to communicate the status of this Lambda execution to the client.
So I suggest the following two approaches that are the closest to what you are trying to achieve -
Make the operation completely synchronous. i.e., do the DynamoDB insert and ElasticSearch insert in the same call (without any Ddb Stream and Lambda triggers). This will ensure that you return the correct status of both writes to the client. Also, in case the ES insert fails, you have an option to revert the Ddb write and then return the complete status as failed.
The first approach obviously adds to the Latency of the function. So you can continue with your original approach, but let the client know about it. It will work as follows -
Client calls your API.
API inserts record into Ddb and returns to the client.
The client receives the status and displays a message to the user that their request is being processed.
The client then starts polling for the status of the ES insert via another API.
Meanwhile, the Ddb stream triggers the ES insert Lambda fn and completes the ES write.
The poller on the client comes to know about the successful insert into ES and displays a final success message to the user.

How to start an ec2 instance using sqs and trigger a python script inside the instance

I have a python script which takes video and converts it to a series of small panoramas. Now, theres an S3 bucket where a video will be uploaded (mp4). I need this file to be sent to the ec2 instance whenever it is uploaded.
This is the flow:
Upload video file to S3.
This should trigger EC2 instance to start.
Once it is running, I want the file to be copied to a particular directory inside the instance.
After this, I want the py file (panorama.py) to start running and read the video file from the directory and process it and then generate output images.
These output images need to be uploaded to a new bucket or the same bucket which was initially used.
Instance should terminate after this.
What I have done so far is, I have created a lambda function that is triggered whenever an object is added to that bucket. It stores the name of the file and the path. I had read that I now need to use an SQS queue and pass this name and path metadata to the queue and use the SQS to trigger the instance. And then, I need to run a script in the instance which pulls the metadata from the SQS queue and then use that to copy the file(mp4) from bucket to the instance.
How do i do this?
I am new to AWS and hence do not know much about SQS or how to transfer metadata and automatically trigger instance, etc.
Your wording is a bit confusing. It says that you want to "start" an instance (which suggests that the instance already exists), but then it says that it wants to "terminate" an instance (which would permanently remove it). I am going to assume that you actually intend to "stop" the instance so that it can be used again.
You can put a shell script in the /var/lib/cloud/scripts/per-boot/ directory. This script will then be executed every time the instance starts.
When the instance has finished processing, it can call sudo shutdown now -h to turn off the instance. (Alternatively, it can tell EC2 to stop the instance, but using shutdown is easier.)
For details, see: Auto-Stop EC2 instances when they finish a task - DEV Community
I tried to answer in the most minimalist way, there are many points below that can be further improved. I think below is still quite some as you mentioned you are new to AWS.
Using AWS Lambda with Amazon S3
Amazon S3 can send an event to a Lambda function when an object is created or deleted. You configure notification settings on a bucket, and grant Amazon S3 permission to invoke a function on the function's resource-based permissions policy.
When the object uploaded it will trigger the lambda function. Which creates the instance with ec2 user data Run commands on your Linux instance at launch.
For the ec2 instance make you provide the necessary permissions via Using instance profiles for download and uploading the objects.
user data has a script that does the rest of the work which you need for your workflow
Download the s3 object, you can pass the name and s3 bucket name in the same script
Once #1 finished, start the panorama.py which processes the videos.
In the next step you can start uploading the objects to the S3 bucket.
Eventually terminating the instance will be a bit tricky which you can achieve Change the instance initiated shutdown behavior
OR
you can use below method for terminating the instnace, but in that case your ec2 instance profile must have access to terminate the instance.
ec2-terminate-instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
You can wrap the above steps into a shell script inside the userdata.
Lambda ec2 start instance:
def launch_instance(EC2, config, user_data):
ec2_response = EC2.run_instances(
ImageId=config['ami'], # ami-0123b531fc646552f
InstanceType=config['instance_type'],
KeyName=config['ssh_key_name'],
MinCount=1,
MaxCount=1,
SecurityGroupIds=config['security_group_ids'],
TagSpecifications=tag_specs,
# UserData=base64.b64encode(user_data).decode("ascii")
UserData=user_data
)
new_instance_resp = ec2_response['Instances'][0]
instance_id = new_instance_resp['InstanceId']
print(f"[DEBUG] Full ec2 instance response data for '{instance_id}': {new_instance_resp}")
return (instance_id, new_instance_resp)
Upload file to S3 -> Launch EC2 instance

Pipeline Transform Logging on Apache Beam in Dataproc

Recently, I deployed a very simple Apache Beam pipeline to get some insights into how it behaved executing in Dataproc as opposed to on my local machine. I quickly realized that after executing that any DoFn or transform-level logging didn't appear within the job logs within the Google Cloud Console as I would have expected and I'm not entirely sure what might be missing.
All of the high level logging messages are emitted as expected:
// This works
log.info("Testing logging operations...")
pipeline
.apply(Create.of(...))
.apply(ParDo.of(LoggingDoFn))
The LoggingDoFn class here is a very basic transform that emits each of the values that it encounters as seen below:
object LoggingDoFn : DoFn<String, ...>() {
private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
#ProcessElement
fun processElement(c: ProcessContext) {
// This is never emitted within the logs
log.info("Attempting to parse ${c.element()}")
}
}
As detailed in the comments, I can see logging messages outside of the processElement() calls (presumably because those are being executed by the Spark runner), but is there a way to easily expose those within the inner transform as well? When viewing the logs related to this job, we can see the higher-level logging present, but no mention of a "Attempting to parse ..." message from the DoFn:
The job itself is being executed by the following gcloud command, which has the driver log levels explicitly defined, but perhaps there's another level of logging or configuration that needs to be added:
gcloud dataproc jobs submit spark --jar=gs://some_bucket/deployment/example.jar --project example-project --cluster example-cluster --region us-example --driver-log-levels com.example=DEBUG -- --runner=SparkRunner --output=gs://some_bucket/deployment/out
To summarize, log messages are not being emitted to the Google Cloud Console for tasks that would generally be assigned to the Spark runner itself (e.g. processElement()). I'm unsure if it's a configuration-related issue or something else entirely.

Is it possible to update only part of a Glue Job using AWS CLI?

I am trying to include in my CI/CD development the update of the script_location and only this parameter. AWS is asking me to include the required parameters such as RoleArn. How can I only update the part of the job configuration I want to change ?
This is what I am trying to use
aws glue update-job --job-name <job_name> --job-update Command="{ScriptLocation=s3://<s3_path_to_script>}
This is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Command name should not be null or empty.
If I add the default Command Name glueetl, this is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Role should not be null or empty.
An easy way to update via CLI a glue-job or a glue-trigger is using --cli-input-json option. In order to use correct json you could use aws glue update-job --generate-cli-skeleton what returns a complete structure to insert your changes.
EX:
{"JobName":"","JobUpdate":{"Description":"","LogUri":"","Role":"","ExecutionProperty":{"MaxConcurrentRuns":0},"Command":{"Name":"","ScriptLocation":"","PythonVersion":""},"DefaultArguments":{"KeyName":""},"NonOverridableArguments":{"KeyName":""},"Connections":{"Connections":[""]},"MaxRetries":0,"AllocatedCapacity":0,"Timeout":0,"MaxCapacity":null,"WorkerType":"G.1X","NumberOfWorkers":0,"SecurityConfiguration":"","NotificationProperty":{"NotifyDelayAfter":0},"GlueVersion":""}}
Well here just fill the name of the job and change the options.
After this you have to transform your json into a one-line json and send into the command using ' '
aws glue update-job --cli-input-json '<one-line-json>'
I hope help someone with this problem too.
Ref:
https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html
https://w3percentagecalculator.com/json-to-one-line-converter/
I don't know whether you've solved this problem, but I managed using this command:
aws glue update-job --job-name <gluejobname> --job-update Role=myRoleNameBB,Command="{Name=<someupdatename>,ScriptLocation=<local_filename.py>}"
You don't need the the ARN of the role, rather the role name. The example above assumes that you have a role with the name myRoleNameBB and it has access to AWS Glue.
Note: I used a local file on my laptop. Also, the "Name" in "Command" part is also compulsory.
When I run it I go this output:
{
"JobName": "<gluejobname>"
}
Based on what I have found, there is no way to update just part of the job using the update-job API.
I ran into the same issue and I provided the role to get past this error. The command worked but the update-job API actually resets other parameters to defaults such as Type of application, Job Language,Class, Timeout, Max Capacity, etc.
So if your pre-existing job is a Spark Application in scala, it will fail as AWS defaults to Python Shell and python as job language as part of the update-job API. And this API provides no way to set job Language type to scala and set a main class (required in case of scala). It provides a way to set the application type to Spark application.
If you do not want to specify the Role to the update-job API. One approach is to copy the new script with the same name and same location that your pre-existing ETL job uses and then trigger your ETL using start-job API as part of the CI process.
Second approach is to run your ETL directly and force it to use the latest script in the start-job API call:
aws glue start-job-run --job-name <job-name> --arguments=scriptLocation="<path to your latest script>"
The only caveat with the second approach is when you look in the console the ETL job will still be referencing the old script Location. The above command just forces this run of the job to use the latest script which you can confirm by looking in the History tab on the Glue ETL console.

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.
Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

Resources