How to create make_batch_reader object of petastorm library in DataBricks? - keras

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name#storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.

Related

Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']

Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
The first assumption was that access permissions are missing, but that wasn't the case.
When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark
Python script example with execution example:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
Again, note that this script keeps your views compatible with Glue/Athena.
References:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/29
https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

Data Ingestion Using Big query's python client exceeds cloud functions maximum limit

I am trying to auto-ingest data from gcs into bigquery using a bucket triggered cloud function.the file types are gzipped json files which can have a maximum size of 2gb.the cloud function works fine with small files.however it tends to timeout when i give it large files that range from 1 to 2 gbs.is there a way to further optimize my function here is the code below:
def bigquery_job_trigger(data, context):
# Set up our GCS, and BigQuery clients
storage_client = storage.Client()
client = bigquery.Client()
file_data = data
file_name = file_data["name"]
table_id = 'BqJsonIngest'
bucket_name = file_data["bucket"]
dataset_id = 'dataDelivery'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
blob = storage_client.bucket(bucket_name).get_blob(file_name)
file = blob.open("rb")
client.load_table_from_file(
file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
)
If the file's already in GCS, there's no need to open the blob inside your function (or the need to do so is not apparently from the snippet provided).
See client.load_table_from_uri, or just checkout one of the existing code samples like https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-csv#bigquery_load_table_gcs_csv-python

Autocreate tables in Bigquery for multiple CSV files

I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")
Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.

Unable to remove Azure Synapse AutoML demand forecasting error: An invalid value for argument [y] was provided

I am trying to build a simple demand forecasting model using Azure AutoML in Synapse Notebook using Spark and SQL Context.
After aggregating the item quantity with respect to date and item id, this is what my data looks like this in the event_file_processed.parquet file:
The date range is from 2020-08-13 to 2021-02-08.
I am following this documentation by MS: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast
Here's how I have divided my train_data and test_data parquet files:
%%sql
CREATE OR REPLACE TEMPORARY VIEW train_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date <= '2020-12-20'
ORDER BY
the_date ASC`
%%sql
CREATE OR REPLACE TEMPORARY VIEW test_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date > '2020-12-20'
ORDER BY
the_date ASC`
%%pyspark
train_data = spark.sql("SELECT * FROM train_data")
train_data.write.parquet("train_data.parquet")
test_data = spark.sql("SELECT * FROM test_data")
test_data.write.parquet("test_data.parquet")`
Below are my AutoML settings and run submission:
from azureml.automl.core.forecasting_parameters import ForecastingParameters
forecasting_parameters = ForecastingParameters(time_column_name='the_date',
forecast_horizon=44,
time_series_id_column_names=["items_id"],
freq='W',
target_lags='auto',
target_aggregation_function = 'sum',
target_rolling_window_size = 3,
short_series_handling_configuration = 'auto'
)
train_data = spark.read.parquet("train_data.parquet")
train_data.createOrReplaceTempView("train_data")
label = "total_item_qty"
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=15,
enable_early_stopping=True,
training_data=train_data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
from azureml.core import Workspace, Datastore
# Enter your workspace subscription, resource group, name, and region.
subscription_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" #you should be owner or contributor
resource_group = "XXXXXXXXXXX" #you should be owner or contributor
workspace_name = "XXXXXXXXXXX" #your workspace name
ws = Workspace(workspace_name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group)
experiment = Experiment(ws, "AML-demand-forecasting-synapse")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()
I am badly stuck in the error below:
Error:
DataException: DataException:
Message: An invalid value for argument [y] was provided.
InnerException: InvalidValueException: InvalidValueException:
Message: Assertion Failed. Argument y is null. Target: y. Reference Code: b7440909-05a8-4220-b927-9fcb43fbf939
InnerException: None
ErrorResponse
I have checked that there are no null or rogue values in total_item_qty, the type in the schema for the 3 variables is also correct.
If you can please give some suggestions, I'll be obliged.
Thanks,
Shantanu Jain
Assuming you are not using the Notebooks that the Synapse UI generates. If you use the wizard in Synapse, it will actually generate a PySpark notebook that you can run and tweak.
That experience is described here: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-automl
The are two issues:
Since you are running from Synapse, you are probably intending to run AutoML on Spark compute. In this case, you need to pass a spark context to the AutoMLConfig constructor: spark_context=sc
Second, you seem to pass a Spark DataFrame to AutoML as the training data. AutoML only supports AML Dataset (TabularDataset) input types in the Spark scenario right now. You can make a conversion like this:
df = spark.sql("SELECT * FROM default.nyc_taxi_train")
datastore = Datastore.get_default(ws)
dataset = TabularDatasetFactory.register_spark_dataframe(df, datastore, name = experiment_name + "-dataset")
automl_config = AutoMLConfig(spark_context = sc,....)
Also curious to learn more about your use case and how you intend to use AutoML in Synapse. Please let me know if you would be interested to connect on that topic.
Thanks,
Nellie (from the Azure Synapse Team)

Make Python dictionary available to all spark partitions

I am trying to develop an algorithm in pyspark for which I am working with linalg.SparseVector class. I need to create a dictionary of key value pairs as input to each SparseVector object. Here the keys have to be integers as they represent integers (in my case representing user ids). I have a separate method that reads the input file and returns a dictionary where each user ID ( string) is mapped to an integer index. When I go through the file again and do a
FileRdd.map( lambda x: userid_idx[ x[0] ] ) . I receive a KeyError. I'm thinking this is because my dict is unavailable to all partitions. Is there a way to make userid_idx dict available to all partitions similar to a distributed map in MapReduce? Also I apologize for the mess. I am posting this using my phone. Will update in a while from my laptop.
The code as promised:
from pyspark.mllib.linalg import SparseVector
from pyspark import SparkContext
import glob
import sys
import time
"""We create user and item indices starting from 0 to #users and 0 to #items respectively. This is done to store them in sparseVectors as dicts."""
def create_indices(inputdir):
items=dict()
user_id_to_idx=dict()
user_idx_to_id=dict()
item_idx_to_id=dict()
item_id_to_idx=dict()
item_idx=0
user_idx=0
for inputfile in glob.glob(inputdir+"/*.txt"):
print inputfile
with open(inputfile) as f:
for line in f:
toks=line.strip().split("\t")
try:
user_id_to_idx[toks[1].strip()]
except KeyError:
user_id_to_idx[toks[1].strip()]=user_idx
user_idx_to_id[user_idx]=toks[1].strip()
user_idx+=1
try:
item_id_to_idx[toks[0].strip()]
except KeyError:
item_id_to_idx[toks[0].strip()]=item_idx
item_idx_to_id[item_idx]=toks[0].strip()
item_idx+=1
return user_idx_to_id,user_id_to_idx,item_idx_to_id,item_id_to_idx,user_idx,item_idx
# pass in the hdfs path to the input files and the spark context.
def runKNN(inputdir,sc,user_id_to_idx,item_id_to_idx):
rdd_text=sc.textFile(inputdir)
try:
new_rdd = rdd_text.map(lambda x: (item_id_to_idx[str(x.strip().split("\t")[0])],{user_id_to_idx[str(x.strip().split("\t")[1])]:1})).reduceByKey(lambda x,y: x.update(y))
except KeyError:
sys.exit(1)
new_rdd.saveAsTextFile("hdfs:path_to_output/user/hadoop/knn/output")
if __name__=="__main__":
sc = SparkContext()
u_idx_to_id,u_id_to_idx,i_idx_to_id,i_id_to_idx,u_idx,i_idx=create_indices(sys.argv[1])
u_idx_to_id_b=sc.broadcast(u_idx_to_id)
u_id_to_idx_b=sc.broadcast(u_id_to_idx)
i_idx_to_idx_b=sc.broadcast(i_idx_to_id)
i_id_to_idx_b=sc.broadcast(i_id_to_idx)
num_users=sc.broadcast(u_idx)
num_items=sc.broadcast(i_idx)
runKNN(sys.argv[1],sc,u_id_to_idx_b.value,i_id_to_idx_b.value)
In Spark, that dictionary will already be available to you as it is in all tasks. For example:
dictionary = {1:"red", 2:"blue"}
rdd = sc.parallelize([1,2])
rdd.map(lambda x: dictionary[x]).collect()
# Prints ['red', 'blue']
You will probably find that your issue is actually that your dictionary does not contain the key you are looking up!
From the Spark documentation:
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.
A copy of local variables referenced will be sent to the node along with the task.
Broadcast variables will not help you here, they are simply a tool to improve performance by sending once per node rather than a once per task.

Resources