Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;'] - apache-spark

Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
The first assumption was that access permissions are missing, but that wasn't the case.
When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark
Python script example with execution example:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
Again, note that this script keeps your views compatible with Glue/Athena.
References:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/29
https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

Related

How to create make_batch_reader object of petastorm library in DataBricks?

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name#storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.

Synapse Dedicated SQL Pool - Copy Into Failing With Odd error - Python

I'm getting an error when attempting to insert from a temp table into a table that exists in Synapse, here is the relevant code:
def load_adls_data(self, schema: str, table: str, environment: str, filepath: str, columns: list) -> str:
if self.exists_schema(schema):
if self.exists_table(schema, table):
if environment.lower() == 'prod':
schema = "lvl0"
else:
schema = f"{environment.lower()}_lvl0"
temp_table = self.generate_temp_create_table(schema, table, columns)
sql0 = """
IF OBJECT_ID('tempdb..#CopyDataFromADLS') IS NOT NULL
BEGIN
DROP TABLE #CopyDataFromADLS;
END
"""
sql1 = """
{}
COPY INTO #CopyDataFromADLS FROM
'{}'
WITH
(
FILE_TYPE = 'CSV',
FIRSTROW = 1
)
INSERT INTO {}.{}
SELECT *, GETDATE(), '{}' from #CopyDataFromADLS
""".format(temp_table, filepath, schema, table, Path(filepath).name)
print(sql1)
conn = pyodbc.connect(self._synapse_cnx_str)
conn.autocommit = True
with conn.cursor() as db:
db.execute(sql0)
db.execute(sql1)
If I get rid of the insert statement and just do a select from the temp table in the script:
SELECT * FROM #CopyDataFromADLS
I get the same error in either case:
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
I've run the generated code for both the insert and the select in Synapse and they ran perfectly. Google has no real info on this so could someone assist with this? Thanks
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
This error occurs mostly because of authentication or access.
Make sure you have blob storage contributor access.
In the copy into script, add the authentication key for blob storage, unless it is a public blob storage.
I tried to repro this using copy into statement without authentication and got the same error.
After adding authentication using SAS key data is copied successfully.
Refer the Microsoft document for permissions required for bulk load using copy into statements.

Unable to remove Azure Synapse AutoML demand forecasting error: An invalid value for argument [y] was provided

I am trying to build a simple demand forecasting model using Azure AutoML in Synapse Notebook using Spark and SQL Context.
After aggregating the item quantity with respect to date and item id, this is what my data looks like this in the event_file_processed.parquet file:
The date range is from 2020-08-13 to 2021-02-08.
I am following this documentation by MS: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast
Here's how I have divided my train_data and test_data parquet files:
%%sql
CREATE OR REPLACE TEMPORARY VIEW train_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date <= '2020-12-20'
ORDER BY
the_date ASC`
%%sql
CREATE OR REPLACE TEMPORARY VIEW test_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date > '2020-12-20'
ORDER BY
the_date ASC`
%%pyspark
train_data = spark.sql("SELECT * FROM train_data")
train_data.write.parquet("train_data.parquet")
test_data = spark.sql("SELECT * FROM test_data")
test_data.write.parquet("test_data.parquet")`
Below are my AutoML settings and run submission:
from azureml.automl.core.forecasting_parameters import ForecastingParameters
forecasting_parameters = ForecastingParameters(time_column_name='the_date',
forecast_horizon=44,
time_series_id_column_names=["items_id"],
freq='W',
target_lags='auto',
target_aggregation_function = 'sum',
target_rolling_window_size = 3,
short_series_handling_configuration = 'auto'
)
train_data = spark.read.parquet("train_data.parquet")
train_data.createOrReplaceTempView("train_data")
label = "total_item_qty"
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=15,
enable_early_stopping=True,
training_data=train_data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
from azureml.core import Workspace, Datastore
# Enter your workspace subscription, resource group, name, and region.
subscription_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" #you should be owner or contributor
resource_group = "XXXXXXXXXXX" #you should be owner or contributor
workspace_name = "XXXXXXXXXXX" #your workspace name
ws = Workspace(workspace_name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group)
experiment = Experiment(ws, "AML-demand-forecasting-synapse")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()
I am badly stuck in the error below:
Error:
DataException: DataException:
Message: An invalid value for argument [y] was provided.
InnerException: InvalidValueException: InvalidValueException:
Message: Assertion Failed. Argument y is null. Target: y. Reference Code: b7440909-05a8-4220-b927-9fcb43fbf939
InnerException: None
ErrorResponse
I have checked that there are no null or rogue values in total_item_qty, the type in the schema for the 3 variables is also correct.
If you can please give some suggestions, I'll be obliged.
Thanks,
Shantanu Jain
Assuming you are not using the Notebooks that the Synapse UI generates. If you use the wizard in Synapse, it will actually generate a PySpark notebook that you can run and tweak.
That experience is described here: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-automl
The are two issues:
Since you are running from Synapse, you are probably intending to run AutoML on Spark compute. In this case, you need to pass a spark context to the AutoMLConfig constructor: spark_context=sc
Second, you seem to pass a Spark DataFrame to AutoML as the training data. AutoML only supports AML Dataset (TabularDataset) input types in the Spark scenario right now. You can make a conversion like this:
df = spark.sql("SELECT * FROM default.nyc_taxi_train")
datastore = Datastore.get_default(ws)
dataset = TabularDatasetFactory.register_spark_dataframe(df, datastore, name = experiment_name + "-dataset")
automl_config = AutoMLConfig(spark_context = sc,....)
Also curious to learn more about your use case and how you intend to use AutoML in Synapse. Please let me know if you would be interested to connect on that topic.
Thanks,
Nellie (from the Azure Synapse Team)

PicklingError when using foreachPartition in Pyspark

I am attempting to map a function (which updates records in a Dynamodb table if a certain condition is met) to a large dataframe in Pyspark. I'm aware that functions are pickled and sent to executors, but I've been reading countless examples where the workaround is to insert your map function into the global scope. Unfortunately this has not worked for me.
def update_dynamodb(rows, dynamodb_tb_name, s3_bucket_name, region):
dynamodb_table = boto3.resource('dynamodb', region_name = region).Table(dynamodb_tb_name)
s3_bucket = boto3.resource('s3', region_name = region).Bucket(s3_bucket_name)
for row in rows:
# code that modifies Dynamodb is here....
dynamodb_write_df = df.repartition(num_executors * 2)
dynamodb_write_df.rdd.foreachPartition(lambda x: update_dynamodb(x, dynamodb_tb_name, raw_s3_bucket, region))
This code produces the error:
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o81.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
on this line:
dynamodb_write_df.rdd.foreachPartition(lambda x: update_dynamodb(x, eviv_dynamodb_tb, raw_s3_bucket, region))

How can I get throughput value for the databases present in the Cosmos DB and their state?

I want to get the details for the database, or I can say, the throughput value for all the databases present in Cosmos DB and also want to check the status if databases are active or not?
Is there any API for the same?If not then, how can I get the throughput and status of databases?
I have been through the documentation of the Cosmos DB, but did not found any help.
I want to get the throughput value for all the databases present in the Cosmos DB.
When you are using the Python SDK , you can get it by using the request_charge ,
The CosmosClient object from the Python SDK (see this Quickstart
regarding its usage) exposes a last_response_headers dictionary that
maps all the headers returned by the underlying HTTP API for the last
operation executed. The request charge is available under the
x-ms-request-charge key.
response = client.ReadItem('dbs/database/colls/container/docs/itemId', { 'partitionKey': 'partitionKey' })
request_charge = client.last_response_headers['x-ms-request-charge']
An example for Python Azure SDK (>=4.0.0)
from azure.cosmos import CosmosClient
class AzureHelper(object):
def __init__(self, url, key):
self.url = url
self.key = key
self._client = None
self._keyspace = None
def connect(self):
self._client = CosmosClient(self.url, self.key)
def get_list_keyspaces(self):
result = []
for db in self._client.list_databases():
result.append(db['id'])
return result
def switch_keyspace(self, keyspace):
self._keyspace = self._client.get_database_client(keyspace)
def get_list_tables(self):
result = []
for table in self._keyspace.list_containers():
result.append(table['id'])
return result
def get_throughput(self, table):
container = self._keyspace.get_container_client(table)
offer = container.read_offer()
throughput = offer.properties['content']['offerThroughput']
return throughput
if __name__ == '__main__':
url = '<Cosmos DB URL, like https://testdb.cosmos.azure.com:443>'
key = '<Primary Password>'
az = AzureHelper(url, key)
az.connect()
for keyspace in az.get_list_keyspaces():
az.switch_keyspace(keyspace)
for table in az.get_list_tables():
throughput = az.get_throughput(table)
print(f'keyspace {keyspace}, table {table}, throughput {throughput}')

Resources