In Google Cloud Dataproc, where is all the log stored? - apache-spark

I have a PySpark job that I am distributing across a 1-master, 3-worker cluster.
I have some python print commands which help me debug my code.
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
Now, when I run the code on Google Dataproc with the master set as local, the print outputs correctly. However, when I try to run it on yarn, the print with YARN-based Spark, the print outputs do not appear in the Google Cloud Console under the jobs section of the Dataproc UI.
Where can I access these python print outputs from each of the workers and master which do not appear in the Google Dataproc Console

If you're using Dataproc, why to access the logs via Spark UI? The better way would be to:
Submit a job using gcloud dataproc jobs submit example
Once the job is submitted, you can access Cloud Dataproc job driver output using the Cloud Platform Console, the gcloud command, or Cloud Storage, as explained below.
The Cloud Platform Console allows you to view a job's realtime driver output. To view job output, go to your project's Cloud Dataproc Jobs section, then click on the Job ID to view job output.
Reference Documentation

If you really want to access to the YARN interface (with the detailed list of all the jobs and their logs), you can do the following :
Get the external ip address of your master. You can find it in the Cluster Details/VM instances in the UI.
http://img15.hostingpics.net/pics/386611Capturedecran20170403a162303.png
Just click on your master.
Connect to the URL : http://yourMastersExternalIpAddress:8088/cluster

Related

Get console job output text from dataproc using rest api

I need to retrieve the dataproc job output text using the rest api. Only able to find logs through cloud logging. Can someone let me know if it is possible to get the job output text retrieved through rest api or not. If yes how?
Dataproc job driver output is stored in either the staging bucket (default) or the bucket you specified when you created your cluster. You can first get the job resource, then get the URI through driverOutputResourceUri, then use GCS API to get the actual output. See this doc for more details.
If you use gcloud you can get driverOutputResourceUri with describe:
$ gcloud dataproc jobs describe spark-pi
...
driverOutputResourceUri: gs://dataproc-nnn/jobs/spark-pi/driveroutput
If you just want to view the output, you can use wait:
$ gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab \
--project my-project-id --region my-cluster-region
Waiting for job output...
INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/
...

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Pipeline Transform Logging on Apache Beam in Dataproc

Recently, I deployed a very simple Apache Beam pipeline to get some insights into how it behaved executing in Dataproc as opposed to on my local machine. I quickly realized that after executing that any DoFn or transform-level logging didn't appear within the job logs within the Google Cloud Console as I would have expected and I'm not entirely sure what might be missing.
All of the high level logging messages are emitted as expected:
// This works
log.info("Testing logging operations...")
pipeline
.apply(Create.of(...))
.apply(ParDo.of(LoggingDoFn))
The LoggingDoFn class here is a very basic transform that emits each of the values that it encounters as seen below:
object LoggingDoFn : DoFn<String, ...>() {
private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
#ProcessElement
fun processElement(c: ProcessContext) {
// This is never emitted within the logs
log.info("Attempting to parse ${c.element()}")
}
}
As detailed in the comments, I can see logging messages outside of the processElement() calls (presumably because those are being executed by the Spark runner), but is there a way to easily expose those within the inner transform as well? When viewing the logs related to this job, we can see the higher-level logging present, but no mention of a "Attempting to parse ..." message from the DoFn:
The job itself is being executed by the following gcloud command, which has the driver log levels explicitly defined, but perhaps there's another level of logging or configuration that needs to be added:
gcloud dataproc jobs submit spark --jar=gs://some_bucket/deployment/example.jar --project example-project --cluster example-cluster --region us-example --driver-log-levels com.example=DEBUG -- --runner=SparkRunner --output=gs://some_bucket/deployment/out
To summarize, log messages are not being emitted to the Google Cloud Console for tasks that would generally be assigned to the Spark runner itself (e.g. processElement()). I'm unsure if it's a configuration-related issue or something else entirely.

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.
Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

Resources