Log link of failed Hive job submitted to Dataproc through Airflow - python-3.x

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.

Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

Related

Get console job output text from dataproc using rest api

I need to retrieve the dataproc job output text using the rest api. Only able to find logs through cloud logging. Can someone let me know if it is possible to get the job output text retrieved through rest api or not. If yes how?
Dataproc job driver output is stored in either the staging bucket (default) or the bucket you specified when you created your cluster. You can first get the job resource, then get the URI through driverOutputResourceUri, then use GCS API to get the actual output. See this doc for more details.
If you use gcloud you can get driverOutputResourceUri with describe:
$ gcloud dataproc jobs describe spark-pi
...
driverOutputResourceUri: gs://dataproc-nnn/jobs/spark-pi/driveroutput
If you just want to view the output, you can use wait:
$ gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab \
--project my-project-id --region my-cluster-region
Waiting for job output...
INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/
...

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Pipeline Transform Logging on Apache Beam in Dataproc

Recently, I deployed a very simple Apache Beam pipeline to get some insights into how it behaved executing in Dataproc as opposed to on my local machine. I quickly realized that after executing that any DoFn or transform-level logging didn't appear within the job logs within the Google Cloud Console as I would have expected and I'm not entirely sure what might be missing.
All of the high level logging messages are emitted as expected:
// This works
log.info("Testing logging operations...")
pipeline
.apply(Create.of(...))
.apply(ParDo.of(LoggingDoFn))
The LoggingDoFn class here is a very basic transform that emits each of the values that it encounters as seen below:
object LoggingDoFn : DoFn<String, ...>() {
private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
#ProcessElement
fun processElement(c: ProcessContext) {
// This is never emitted within the logs
log.info("Attempting to parse ${c.element()}")
}
}
As detailed in the comments, I can see logging messages outside of the processElement() calls (presumably because those are being executed by the Spark runner), but is there a way to easily expose those within the inner transform as well? When viewing the logs related to this job, we can see the higher-level logging present, but no mention of a "Attempting to parse ..." message from the DoFn:
The job itself is being executed by the following gcloud command, which has the driver log levels explicitly defined, but perhaps there's another level of logging or configuration that needs to be added:
gcloud dataproc jobs submit spark --jar=gs://some_bucket/deployment/example.jar --project example-project --cluster example-cluster --region us-example --driver-log-levels com.example=DEBUG -- --runner=SparkRunner --output=gs://some_bucket/deployment/out
To summarize, log messages are not being emitted to the Google Cloud Console for tasks that would generally be assigned to the Spark runner itself (e.g. processElement()). I'm unsure if it's a configuration-related issue or something else entirely.

Is it possible to update only part of a Glue Job using AWS CLI?

I am trying to include in my CI/CD development the update of the script_location and only this parameter. AWS is asking me to include the required parameters such as RoleArn. How can I only update the part of the job configuration I want to change ?
This is what I am trying to use
aws glue update-job --job-name <job_name> --job-update Command="{ScriptLocation=s3://<s3_path_to_script>}
This is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Command name should not be null or empty.
If I add the default Command Name glueetl, this is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Role should not be null or empty.
An easy way to update via CLI a glue-job or a glue-trigger is using --cli-input-json option. In order to use correct json you could use aws glue update-job --generate-cli-skeleton what returns a complete structure to insert your changes.
EX:
{"JobName":"","JobUpdate":{"Description":"","LogUri":"","Role":"","ExecutionProperty":{"MaxConcurrentRuns":0},"Command":{"Name":"","ScriptLocation":"","PythonVersion":""},"DefaultArguments":{"KeyName":""},"NonOverridableArguments":{"KeyName":""},"Connections":{"Connections":[""]},"MaxRetries":0,"AllocatedCapacity":0,"Timeout":0,"MaxCapacity":null,"WorkerType":"G.1X","NumberOfWorkers":0,"SecurityConfiguration":"","NotificationProperty":{"NotifyDelayAfter":0},"GlueVersion":""}}
Well here just fill the name of the job and change the options.
After this you have to transform your json into a one-line json and send into the command using ' '
aws glue update-job --cli-input-json '<one-line-json>'
I hope help someone with this problem too.
Ref:
https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html
https://w3percentagecalculator.com/json-to-one-line-converter/
I don't know whether you've solved this problem, but I managed using this command:
aws glue update-job --job-name <gluejobname> --job-update Role=myRoleNameBB,Command="{Name=<someupdatename>,ScriptLocation=<local_filename.py>}"
You don't need the the ARN of the role, rather the role name. The example above assumes that you have a role with the name myRoleNameBB and it has access to AWS Glue.
Note: I used a local file on my laptop. Also, the "Name" in "Command" part is also compulsory.
When I run it I go this output:
{
"JobName": "<gluejobname>"
}
Based on what I have found, there is no way to update just part of the job using the update-job API.
I ran into the same issue and I provided the role to get past this error. The command worked but the update-job API actually resets other parameters to defaults such as Type of application, Job Language,Class, Timeout, Max Capacity, etc.
So if your pre-existing job is a Spark Application in scala, it will fail as AWS defaults to Python Shell and python as job language as part of the update-job API. And this API provides no way to set job Language type to scala and set a main class (required in case of scala). It provides a way to set the application type to Spark application.
If you do not want to specify the Role to the update-job API. One approach is to copy the new script with the same name and same location that your pre-existing ETL job uses and then trigger your ETL using start-job API as part of the CI process.
Second approach is to run your ETL directly and force it to use the latest script in the start-job API call:
aws glue start-job-run --job-name <job-name> --arguments=scriptLocation="<path to your latest script>"
The only caveat with the second approach is when you look in the console the ETL job will still be referencing the old script Location. The above command just forces this run of the job to use the latest script which you can confirm by looking in the History tab on the Glue ETL console.

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

Resources