I have written below function in pyspark to get deptid and return a dataframe which i want to use in spark sql .
def get_max_salary(deptid):
sql_salary="select max(salary) from empoyee where depid ={}"
df_salary = spark.sql(sql_salary.format(deptid))
return df_salary
spark.udf.register('get_max_salary',get_max_salary)
However i get below error message . I searched online but i couldnt find a proper solution anywhere . could someone please help me here
Error Message - PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Related
I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first
I am new to Azure and Spark and request your help on writing the exception handling code for the below scenario.
I have written HQL scripts (say hql1, hql2, hql3) in 3 different notebooks and calling them all on one master notebook (hql-master) as,
val df_tab1 = runQueryForTable("hql1", spark)
val df_tab2 = runQueryForTable("hql2", spark)
Now I have the output of HQL scripts stored as dataframe and I have to write exception handling on master notebook where if the master notebook has successfully executed all the dataframes (df1_tab, df2_tab), a success status should get inserted into the synapse table job_status.
Else if there was any error/exception during the execution of master notebook/dataframe, then that error message should be captured and a failure status should get inserted into the synapse table.
I already have the INSERT scripts for success/failure message insert. It will be really helpful if you please provide a sample code snippet through which the exception handling part can be achieved. Thank you!!
basically, it's just a simple try/except code, something like this:
results = {}
were_errors = False
for script_name in ['script1', 'script2', 'script3']:
try:
retValue = dbutils.notebook.run(script_name)
results[script_name] = retValue
except Exception as e:
results[script_name] = "Error: {e}"
were_errors = True
if were_errors:
log failure # you can use data from results variable
else:
log success
spark (2.4.5) is throwing the following error when trying to execute a select query similar to one shown below.
org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 12
SELECT cast(coalesce(column1,'') as string) as id,cast(coalesce(column2,'2020-01-01') as date) as date1
from 4dea68ed921940e58f027e7146d495a4
Table 4dea68ed921940e58f027e7146d495a4 is a temp view created in spark from dataframe.
This error is happening intermittently only after certain processes. Any help would be much appreciated.
The spark job is submitted through livy. Job contains two optional parameters and only one was provided. Providing all the parameters resolved the issue. Don't know why not providing an optional parameter caused this weird behavior but resolved the issue
I'm using the map method of DynamicFrame (or, equivalently, the Map.apply method). I've noticed that any errors in the function that I pass to these functions are silently ignored and cause the returned DynamicFrame to be empty.
Say I have a job script like this:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
glueContext = GlueContext(SparkContext.getOrCreate())
dyF = glueContext.create_dynamic_frame.from_catalog(database="radixdemo", table_name="census_csv")
def my_mapper(rec):
import logging
logging.error("[RADIX] An error-log from in the mapper!")
print "[RADIX] from in the mapper!"
raise Exception("[RADIX] A bug!")
dyF = dyF.map(my_mapper, 'my_mapper')
print "Count: ", dyF.count()
dyF.printSchema()
dyF.toDF().show()
If I run this script in my Glue Dev Endpoint with gluepython, I get output like this:
[glue#ip-172-31-83-196 ~]$ gluepython gluejob.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/23 20:56:46 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Count: 0
root
++
||
++
++
Notes about this output:
I don't see the result of the print statement or the logging.error statement.
There's no indication that my_mapper raised an exception.
The printSchema call is showing that there is no schema metadata on the produced DynamicFrame
the show method also isn't producing any output, indicating that all the rows are gone.
Likewise, when I save this script as a job in the AWS Glue console, and run it, the job doesn't indicate any error occurred -- The Job Status is "Succeeded". Notably, I do get the print statements and logging.error calls output to the job logs, but only in the regular "Logs", not the "Error Logs".
What I want is to be able to indicate that my job has failed, and to be able to easily find these error logs. Most important is to just indicate that it has failed.
Is there a way to log an error within a mapped function in such a way that Glue will pick it up as an "Error Log" (and put it in that separate AWS CloudWatch Logs path)? If this happens, will it automatically mark the entire Job as Failing? Or is there some other way to explicitly fail the job from within a mapped function?
(my plan, if there is a way to log errors and/or mark the job as failed, is to create a decorator or other utility function that will automatically catch exceptions in my mapped functions and ensure that they are logged & marked as a failure).
The only way I have discovered to make a Glue job show up as "Failed" is to raise an exception from the main script (not inside a mapper or filter function, as those seem to get spun out to the Data Processing Units).
Fortunately, there is a way to detect if an exception occurred inside of a map or filter function: using the DynamicFrame.stageErrorsCount() method. It will return a number indicating how many exceptions were raised while running the most recent transformation.
So the correct way to solve all the problems:
make sure your map or transform function explicitly logs any exceptions that occur inside of it. This is best done by using a decorator function or via some other reusable mechanism, instead of relying on putting try/except statements in every single function you write.
after every single transformation that you want to catch errors in, call the stageErrorsCount() method and check if it's greater than 0. If you want to abort the job, just raise an exception.
For example:
import logging
def log_errors(inner):
def wrapper(*args, **kwargs):
try:
return inner(*args, **kwargs)
except Exception as e:
logging.exception('Error in function: {}'.format(inner))
raise
return wrapper
#log_errors
def foo(record):
1 / 0
Then, inside your job, you'd do something like:
df = df.map(foo, "foo")
if df.stageErrorsCount() > 0:
raise Exception("Error in job! See the log!")
Note that even calling logging.exception from inside the mapper function still doesn't write the logs to the error log in AWS CloudWatch Logs, for some reason. It gets written to the regular success logs. However, with this technique you will at least see that the job failed and be able to find the info in the logs. Another caveat: Dev Endpoints don't seem to show ANY logs from the mapper or filter functions.
I have written a spark code -
customDF.registerTempTable("customTable")
var query = "select date, id ts,errorode from customTable"
val finalCustomDF = hiveContext.sql(query)
finalCustomDF.write.format("com.databricks.spark.csv").save("/user/oozie/data")
When i run this code using spark submit, it runs fine but when i run it using oozie coordinator. I get following exception.
User class threw exception: org.apache.spark.sql.AnalysisException: character '<EOF>' not supported here; line 1 pos 111
I have tried reading data from existing hive table it works but issue is with customTable.