PicklingError when using foreachPartition in Pyspark - apache-spark

I am attempting to map a function (which updates records in a Dynamodb table if a certain condition is met) to a large dataframe in Pyspark. I'm aware that functions are pickled and sent to executors, but I've been reading countless examples where the workaround is to insert your map function into the global scope. Unfortunately this has not worked for me.
def update_dynamodb(rows, dynamodb_tb_name, s3_bucket_name, region):
dynamodb_table = boto3.resource('dynamodb', region_name = region).Table(dynamodb_tb_name)
s3_bucket = boto3.resource('s3', region_name = region).Bucket(s3_bucket_name)
for row in rows:
# code that modifies Dynamodb is here....
dynamodb_write_df = df.repartition(num_executors * 2)
dynamodb_write_df.rdd.foreachPartition(lambda x: update_dynamodb(x, dynamodb_tb_name, raw_s3_bucket, region))
This code produces the error:
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o81.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
on this line:
dynamodb_write_df.rdd.foreachPartition(lambda x: update_dynamodb(x, eviv_dynamodb_tb, raw_s3_bucket, region))

Related

How to create make_batch_reader object of petastorm library in DataBricks?

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name#storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.

Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']

Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
The first assumption was that access permissions are missing, but that wasn't the case.
When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark
Python script example with execution example:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
Again, note that this script keeps your views compatible with Glue/Athena.
References:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/29
https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

Insert items in DynamoDB using lambdas without losing data

I am new to AWS and I am trying to load data into a DynamoDB using lambda functions and python. The problem I have is the following: When I try to load a record into a table, the items that have the same Partition Key as the element I'm trying to insert are removed from the table. This is the code that I'm using (I got it from the AWS documentation):
import boto3
from pprint import pprint
def put_car(car_id, car_type, message, dynamodb=None):
if not dynamodb:
dynamodb = boto3.resource('dynamodb', region_name='eu-west-1')
table = dynamodb.Table('Cars')
response = table.put_item(
Item={
'car_type': car_type,
'car_id': car_id,
'message': message,
}
)
return response
def lambda_handler(event, context):
car_resp = put_car("1", "Cartype1",
"Car 1")
print("Put car succeeded:")
pprint(car_resp)
A possible solution would be to read all the records first and load them all again, including the record I wanted to insert, but this solution seems quite inefficient and I think there may be some easier way to do it.

In Spark how to call UDF with UDO as parameters to avoid binary error

I defined a UDF with UDO as parameters. But when I tried to call it in dataframe, I got the error message "org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array) => int)". Just want to know it is expected that the exception mentioned the UDO as binary, and also how should I fix it?
val logCount = (logs: util.List[LogRecord]) => logs.size()
val logCountUdf = udf(logCount)
// The column 'LogRecords' is the agg function collect_list of UDO LogRecord
df.withColumn("LogCount", logCountUdf($"LogRecords"))
in general you cant pass in custom objects into UDFs, and you should only call the udf for non-null rows otherwise there will be a NullPointerException inside your UDF. try:
val logCount = (logs: Seq[Row]) => logs.size()
val logCountUdf = udf(logCount)
df.withColumn("LogCount", when($"LogRecords".isNotNull,logCountUdf($"LogRecords")))
or just use the built-in function size to get the logCount :
df.withColumn("LogCount", size($"LogRecords"))

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Resources