How do I handle errors in mapped functions in AWS Glue? - apache-spark

I'm using the map method of DynamicFrame (or, equivalently, the Map.apply method). I've noticed that any errors in the function that I pass to these functions are silently ignored and cause the returned DynamicFrame to be empty.
Say I have a job script like this:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
glueContext = GlueContext(SparkContext.getOrCreate())
dyF = glueContext.create_dynamic_frame.from_catalog(database="radixdemo", table_name="census_csv")
def my_mapper(rec):
import logging
logging.error("[RADIX] An error-log from in the mapper!")
print "[RADIX] from in the mapper!"
raise Exception("[RADIX] A bug!")
dyF = dyF.map(my_mapper, 'my_mapper')
print "Count: ", dyF.count()
dyF.printSchema()
dyF.toDF().show()
If I run this script in my Glue Dev Endpoint with gluepython, I get output like this:
[glue#ip-172-31-83-196 ~]$ gluepython gluejob.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/23 20:56:46 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Count: 0
root
++
||
++
++
Notes about this output:
I don't see the result of the print statement or the logging.error statement.
There's no indication that my_mapper raised an exception.
The printSchema call is showing that there is no schema metadata on the produced DynamicFrame
the show method also isn't producing any output, indicating that all the rows are gone.
Likewise, when I save this script as a job in the AWS Glue console, and run it, the job doesn't indicate any error occurred -- The Job Status is "Succeeded". Notably, I do get the print statements and logging.error calls output to the job logs, but only in the regular "Logs", not the "Error Logs".
What I want is to be able to indicate that my job has failed, and to be able to easily find these error logs. Most important is to just indicate that it has failed.
Is there a way to log an error within a mapped function in such a way that Glue will pick it up as an "Error Log" (and put it in that separate AWS CloudWatch Logs path)? If this happens, will it automatically mark the entire Job as Failing? Or is there some other way to explicitly fail the job from within a mapped function?
(my plan, if there is a way to log errors and/or mark the job as failed, is to create a decorator or other utility function that will automatically catch exceptions in my mapped functions and ensure that they are logged & marked as a failure).

The only way I have discovered to make a Glue job show up as "Failed" is to raise an exception from the main script (not inside a mapper or filter function, as those seem to get spun out to the Data Processing Units).
Fortunately, there is a way to detect if an exception occurred inside of a map or filter function: using the DynamicFrame.stageErrorsCount() method. It will return a number indicating how many exceptions were raised while running the most recent transformation.
So the correct way to solve all the problems:
make sure your map or transform function explicitly logs any exceptions that occur inside of it. This is best done by using a decorator function or via some other reusable mechanism, instead of relying on putting try/except statements in every single function you write.
after every single transformation that you want to catch errors in, call the stageErrorsCount() method and check if it's greater than 0. If you want to abort the job, just raise an exception.
For example:
import logging
def log_errors(inner):
def wrapper(*args, **kwargs):
try:
return inner(*args, **kwargs)
except Exception as e:
logging.exception('Error in function: {}'.format(inner))
raise
return wrapper
#log_errors
def foo(record):
1 / 0
Then, inside your job, you'd do something like:
df = df.map(foo, "foo")
if df.stageErrorsCount() > 0:
raise Exception("Error in job! See the log!")
Note that even calling logging.exception from inside the mapper function still doesn't write the logs to the error log in AWS CloudWatch Logs, for some reason. It gets written to the regular success logs. However, with this technique you will at least see that the job failed and be able to find the info in the logs. Another caveat: Dev Endpoints don't seem to show ANY logs from the mapper or filter functions.

Related

Can we have single metric for multiple GCP resources using advanced filter query in stackdriver logging?

The python code which I have written, it generates some global and BigQuery related logs on stackdriver logging window. Further I am trying to create a metric manually and then send some alerts. I wanted to know whether we can create a single metric for both global and bigquery logs on stackdriver?
In Advanced filter query I tried:
resource.type="bigquery_resource" AND resource.type="global"
severity=ERROR
but its give error: "Invalid request: Request contains an invalid argument"
Then I tried:
resource.type="bigquery_resource", "global" AND
severity=ERROR
again it gives error: "Invalid request: Unparseable filter: syntax error at line 1, column 33, token ','"
import logging
from google.cloud import logging as gcp_logging
client = gcp_logging.Client()
client.setup_logging()
try:
if row_count_before.num_rows == row_count_after.num_rows:
logging.error("Received empty query result")
else:
newly_added_rows = row_count_after.num_rows - row_count_before.num_rows
logging.info("{} rows are found as query result".format(newly_added_rows))
except RuntimeError:
logging.error("Exception occured {}".format(client1.report_exception()))
Looking for a approach where I can have single metric for multiple resource type. Thank you.
I think you want
resource.type="bigquery_resource" OR resource.type="global"
severity=ERROR

PicklingError in Pyspark

I have written below function in pyspark to get deptid and return a dataframe which i want to use in spark sql .
def get_max_salary(deptid):
sql_salary="select max(salary) from empoyee where depid ={}"
df_salary = spark.sql(sql_salary.format(deptid))
return df_salary
spark.udf.register('get_max_salary',get_max_salary)
However i get below error message . I searched online but i couldnt find a proper solution anywhere . could someone please help me here
Error Message - PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Why does it shows same logs twice (one as info and one as error with same message) on stackdriver?

I am preparing some logging using python, but whenever I run my code it generates the logs but shows twice on stackdriver console (one as info and one as error). Anyone have idea on how to deal with this problem.
my code:
import logging
from google.cloud import logging as gcp_logging
log_client = gcp_logging.Client()
log_client.setup_logging()
# here executing some bigquery operations
logging.info("Query result loaded into temporary table: {}".format(temporary_table))
# here executing some bigquery operations
logging.error("Query executed with empty result set.")
When I run above code it show above logs twice on stackdriver.
Info:2019-10-17T11:54:02.504Z cf-mycloudfunctionname Query result loaded into temporary table: mytable
Error:2019-10-17T11:54:02.505Z cf-mycloudfunctionname Query result loaded into temporary table: mytable
What I can see is that both (error & info) has been recognized as a plane text, so it sent the same message as info for stderr and stdout, this is why you are getting two same messages.
What you need to do is to correct phrase the this two logs into an structured JSON, this to be recognized by stackdriver as one entity with the correct payload to be displayed.
Additional, you can configure the stackdriver agent to send the logs as you need to be sent, take a look to this document.
Also It will depend from where you are trying to retrieve this logs GCE, GKE, BQ. In some cases its preferible to change the structure of fluentd directly instead of the stackdriver agent.

How can I print actual query generated by SQLAlchemy?

I'm trying to log all SQLAlchemy queries to the console while parsing the query and filling in the parameters (e.g. translating :param_1 to 123). I managed to find this answer on SO that does just that. The issue I'm running into is that parameters don't always get translated.
Here is the event I'm latching onto -
#event.listens_for(Engine, 'after_execute', named=True)
def after_cursor_execute(**kw):
conn = kw['conn']
params = kw['params']
result = kw['result']
stmt = kw['clauseelement']
multiparams = kw['multiparams']
print(literalquery(stmt))
Running this query will fail to translate my parameters. Instead, I'll see :param_1 in the output -
Model.query.get(123)
It yields a CompileError exception with message Bind parameter '%(38287064 param)s' without a renderable value not allowed here..
However, this query will translate :param_1 to 123 like I would expect -
db.session.query(Model).filter(Model.id == 123).first()
Is there any way to translate any and all queries that are run using SQLAlchemy?
FWIW I'm targeting SQL Server using the pyodbc driver.
If you set up the logging framework, you can get the SQL statements logged by setting the sqlalchemy.engine logger at INFO level, e.g.:
import logging
logging.basicConfig()
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

AWS Lambda Function (Node) - Custom timeout logging

I'm wondering if there is any way of hijacking the standard "Task timed out after 1.00 seconds" log.
Bit of context : I'm streaming lambda function logs into AWS Elasticsearch / Kibana, and one of the things I'm logging is whether or not the function successfully executed (good to know). I've set up a test stream to ES, and I've been able to define a pattern to map what I'm logging to fields in ES.
From the function, I console log something like:
"\"FAIL\"\"Something messed up\"\"0.100 seconds\""
and with the mapping, I get a log structure like:
Status - Message -------------------- Execution Time
FAIL ---- Something messed up --- 0.100 seconds
... Which is lovely. However if a log comes in like:
"Task timed out after 1.00 seconds"
then the mapping will obviously not apply. If it's picked up by ES it will likely dump the whole string into "Status", which is not ideal.
I thought perhaps I could query context.getRemainingMillis() and if it goes maybe within 10 ms of the max execution time (which you can't get from the context object??) then fire the custom log and ignore the default output. This however feels like a hack.
Does anyone have much experience with logging from AWS Lambda into ES? The key to creating these custom logs with status etc is so that we can monitor the activity of the lambda functions (many), and the default log formats don't allow us to classify the result of the function.
**** EDIT ****
The solution I went with was to modify the lambda function generated by AWS for sending log lines to Elasticsearch. It would be nice if I could interface with AWS's lambda logger to set the log format, however for now this will do!
I'll share a couple key points about this:
The work done for parsing the line and setting the custom fields is done in transform() before the call to buildSource().
The message itself (full log line) is found in logEvent.message.
You don't just reassign the message in the desired format (in fact leaving it be is probably best since the raw line is sent to ES). The key here is to set the custom fields in logEvent.extractedFields. So once I've ripped apart the log line, I set logEvent.extractedFields.status = "FAIL", logEvent.extractedFields.message = "Whoops.", etc etc.

Resources