View AWS Glue job logs with CLI - aws-cli

The AWS CLI is very good for managing AWS Glue jobs. But if a job fails, I may not see anything more useful than something like:
"JobRunState": "FAILED",
"ErrorMessage": "User application exited with status 10",
And I have to go through the mountain of CloudWatch logs hoping to find something useful. Would appreciate any ideas on getting all logs through the CLI so I can use things like grep.

Found this question while looking for the answer myself. The following command gets the logs for the last job,
JOB_ID=$(aws glue get-job-runs --job-name $JOB_NAME --query 'JobRuns[0].Id' --output text)
aws logs get-log-events --log-group-name /aws-glue/jobs/output --log-stream-name $JOB_ID
Where $JOB_NAME is the name of your Glue job. You can also use the log group name /aws-glue/jobs/error to see messages written to stderr though I've found /output more useful.

Related

EC2 instance running S3 Sync command terminates before data transfer is complete

I have an EC2 instance running Linux. This instance is used to run aws s3 commands.
I want to sync the last 6 months worth of data from source to target S3 buckets. I am using credentials with the necessary permissions to do this.
Initially I just ran the command:
aws s3 sync "s3://source" "s3://target" --query "Contents[?LastModified>='2022-08-11' && LastModified<='2023-01-11']"
However, after maybe 10 mins this command stops running, and only a fraction of the data is synced.
I thought this was because my SSM session was terminating, and with it the command stopped executing.
To combat this, I used the following command to try and ensure that this command would continue to execute even after my SSM terminal session was closed:
nohup aws s3 sync "s3://source" "s3://target" --query "Contents[?LastModified>='2022-08-11' && LastModified<='2023-01-11']" --exclude "*.log" --exclude "*.bak" &
Checking the status of the EC2 instance, the command appears to run for about 20 mins, before clearly stopping for some reason.
The --query parameter controls what information is displayed in the response from an API call.
It does not control which files are copied in an aws s3 sync command. The documentation for aws s3 sync defines the --query parameter as: "A JMESPath query to use in filtering the response data."
Your aws s3 sync command will be synchronizing ALL files unless you use Exclude and Include Filters. These filters operate on the name of the object. It is not possible to limit the sync command by supplying date ranges.
I cannot comment on why the command would stop running before it is complete. I suggest you redirect output to a log file and then review the log file for any clues.

How to use "AWS logs tail" - syntax question

I'm new to AWS CLI and having trouble getting the tail command to work. I'm using version 2. I try the syntax
aws logs tail group_name /aws/lambda/schedule-jobs --since 1h
and it gives me the error
unknown options: /aws/lambda/schedule-jobs
If I try it without the group_name it tells me the group_name is a required parameter (of course).
If I try it with another log group name it tells me the same error message (with the different log group name).
If I put quotes around the name it gives the same error message.
I know this must be a very simple scenario but I can't find any working examples on Google search or in other SO entries. What am I doing wrong?
Yes, when I call
aws logs describe-log-groups
my log group is there and spelled correctly. I'm in the right account and region.
Yes, I am an admin on the account and have full access to the logs.
Just write aws logs tail /aws/lambda/schedule-jobs --since 1h
aws logs tail command doesn't work on aws cli v2.
You can use the following command to access the logs using v2 API.
aws logs get-log-events --log-group-name <log-group-name> --log-stream-name <log-stream-name> --limit=1000

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.
Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

Errors trying to resize instance group with aws cli

I have a standing EMR cluster and a daily job I want to run. I was trying to use the aws cli to resize the cluster, with the plan of adding this to the crontab so the cluster would grow and then shrink later. (I don't have ability for auto-scaling, so that's out)
I have read the Amazon documentation and the examples they give don't work. I've tried the natural variations, but end up getting nowhere.
According to the documentation, the command is
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
However, when I try this with my own instance ID, i get:
Error parsing parameter '--instance-groups': Expected: '<second>', received: '<none>' for input:InstanceGroupId=ig-31JXXXXXXBTO,
I've tried doing things like removing the instance count, hoping for more documentation...
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-WCXEP0AXCGJS
which gives the response
An error occurred (ValidationException) when calling the ModifyInstanceGroups operation: Please provide either an instance count or a list of EC2 instance ids to terminate.
I've tried several variations without luck. Any ideas? Thanks.
I ended up submitting a trouble ticket through Amazon.
The resize command requires that no space occur after the comma. The trouble shooter has reported this behavior and unhelpful error to the developers.
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
will work, as long as there is no space after the comma. hopefully they'll either fix that or provide a better error message.

Resources