loading data into delta lake from azure blob storage - python-3.x

I am trying to load data into delta lake from azure blob storage.
I am using below code snippet
storage_account_name = "xxxxxxxxdev"
storage_account_access_key = "xxxxxxxxxxxxxxxxxxxxx"
file_location = "wasbs://bicc-hdspk-eus-qc#xxxxxxxxdev.blob.core.windows.net/FSHC/DIM/FSHC_DIM_SBU"
file_type = "csv"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
dx = df.write.format("parquet")
Till this step it is working and I am also able to load it into databricks table.
dx.write.format("delta").save(file_location)
error : AttributeError: 'DataFrameWriter' object has no attribute 'write'
p.s. - Am I passing the file location wrong into the write statement? If this is the cause then what is file path for delta lake.
Please revert to me in case additional information is needed.
Thanks,
Abhirup

dx is a dataframewriter, so what youre trying to do doesnt make sense. You could do this:
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
df.write.format("parquet").save()
df.write.format("delta").save()

Related

How to create make_batch_reader object of petastorm library in DataBricks?

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name#storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.

Unload data from SnowFlake into ADLS

We are trying to unload data from SnowFlake (running on Azure) in ADLS.
The File gets created but it is of type "octet-stream".
Below is example of what i am running.
COPY INTO #PARQUET_STG/snowflake/data_share
FROM TABLE
file_format = 'PARQUET_FORMAT';
I even tried
file_format = (type 'parquet') and file_format = (type 'csv')
but same results.
Can someone point out what is missing here?

Data Ingestion Using Big query's python client exceeds cloud functions maximum limit

I am trying to auto-ingest data from gcs into bigquery using a bucket triggered cloud function.the file types are gzipped json files which can have a maximum size of 2gb.the cloud function works fine with small files.however it tends to timeout when i give it large files that range from 1 to 2 gbs.is there a way to further optimize my function here is the code below:
def bigquery_job_trigger(data, context):
# Set up our GCS, and BigQuery clients
storage_client = storage.Client()
client = bigquery.Client()
file_data = data
file_name = file_data["name"]
table_id = 'BqJsonIngest'
bucket_name = file_data["bucket"]
dataset_id = 'dataDelivery'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
blob = storage_client.bucket(bucket_name).get_blob(file_name)
file = blob.open("rb")
client.load_table_from_file(
file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
)
If the file's already in GCS, there's no need to open the blob inside your function (or the need to do so is not apparently from the snippet provided).
See client.load_table_from_uri, or just checkout one of the existing code samples like https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-csv#bigquery_load_table_gcs_csv-python

Azure Python SDK data tables

I need help to get through this workflow.
I have 2 storage accounts which I name storage1 and storage2
storage1 contrains a list of tables with some data in, and I would like to loop through all those tables, copy their content into storage2. I tried with azCopy but I had no luck as this feature is available only in azCopy v7.3 and I couldn't find this version for MacOs M1. The other option is Data factory but its too complex for what I want to achieve. So I decided to go with azure Python sdk.
As a library I am using azure.data.tables import TableServiceClient
The code I wrote looks like this:
from azure.data.tables import TableServiceClient
my_conn_str_out = 'storage1-Conn-Str'
table_service_client_out = TableServiceClient.from_connection_string(my_conn_str_out)
list_table = []
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
my_conn_str_in = 'Storage2-Conn-str'
table_service_client_in = TableServiceClient.from_connection_string(my_conn_str_in)
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
this is how I structured my code.
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
I loop through all my tables in the storage account and append them into a list.
then:
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
I create the same table in the storage2
So far everything works just fine.
What I would like to achieve now, is to query all the data in each table in storage1 and pass it to the respective table in storage2
According to Microsoft documentation I can achieve the query table using this:
query = table_service_client_out.query_tables(filter=table)
so I integrated this in my loop like this:
for table in table_service_client_out.list_tables():
query = table_service_client_out.query_tables(filter=table)
list_table.append(table.table_name)
print(query)
When I run my python code, I get back the memory allocation of the query and not the data in the tables:
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fbb0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8f7f0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fd60>
I was wondering if there is a way how I can query all the data in my tables and pass them to my storage2
Try this :
from azure.cosmosdb.table.tableservice import TableService,ListGenerator
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
#query 100 items per request, in case of consuming too much menory load all data in one time
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
for tb in tbs_out:
#create table with same name in storage2
table_service_in.create_table(tb.name)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name,data,table_service_out,table_service_in,query_size)
Of course, this is a simple demo for your requirement.For more efficiency, you can also query table data by partition key and commit them by batch
Let me know if you have any more questions.

Unable to remove Azure Synapse AutoML demand forecasting error: An invalid value for argument [y] was provided

I am trying to build a simple demand forecasting model using Azure AutoML in Synapse Notebook using Spark and SQL Context.
After aggregating the item quantity with respect to date and item id, this is what my data looks like this in the event_file_processed.parquet file:
The date range is from 2020-08-13 to 2021-02-08.
I am following this documentation by MS: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast
Here's how I have divided my train_data and test_data parquet files:
%%sql
CREATE OR REPLACE TEMPORARY VIEW train_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date <= '2020-12-20'
ORDER BY
the_date ASC`
%%sql
CREATE OR REPLACE TEMPORARY VIEW test_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date > '2020-12-20'
ORDER BY
the_date ASC`
%%pyspark
train_data = spark.sql("SELECT * FROM train_data")
train_data.write.parquet("train_data.parquet")
test_data = spark.sql("SELECT * FROM test_data")
test_data.write.parquet("test_data.parquet")`
Below are my AutoML settings and run submission:
from azureml.automl.core.forecasting_parameters import ForecastingParameters
forecasting_parameters = ForecastingParameters(time_column_name='the_date',
forecast_horizon=44,
time_series_id_column_names=["items_id"],
freq='W',
target_lags='auto',
target_aggregation_function = 'sum',
target_rolling_window_size = 3,
short_series_handling_configuration = 'auto'
)
train_data = spark.read.parquet("train_data.parquet")
train_data.createOrReplaceTempView("train_data")
label = "total_item_qty"
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=15,
enable_early_stopping=True,
training_data=train_data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
from azureml.core import Workspace, Datastore
# Enter your workspace subscription, resource group, name, and region.
subscription_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" #you should be owner or contributor
resource_group = "XXXXXXXXXXX" #you should be owner or contributor
workspace_name = "XXXXXXXXXXX" #your workspace name
ws = Workspace(workspace_name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group)
experiment = Experiment(ws, "AML-demand-forecasting-synapse")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()
I am badly stuck in the error below:
Error:
DataException: DataException:
Message: An invalid value for argument [y] was provided.
InnerException: InvalidValueException: InvalidValueException:
Message: Assertion Failed. Argument y is null. Target: y. Reference Code: b7440909-05a8-4220-b927-9fcb43fbf939
InnerException: None
ErrorResponse
I have checked that there are no null or rogue values in total_item_qty, the type in the schema for the 3 variables is also correct.
If you can please give some suggestions, I'll be obliged.
Thanks,
Shantanu Jain
Assuming you are not using the Notebooks that the Synapse UI generates. If you use the wizard in Synapse, it will actually generate a PySpark notebook that you can run and tweak.
That experience is described here: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-automl
The are two issues:
Since you are running from Synapse, you are probably intending to run AutoML on Spark compute. In this case, you need to pass a spark context to the AutoMLConfig constructor: spark_context=sc
Second, you seem to pass a Spark DataFrame to AutoML as the training data. AutoML only supports AML Dataset (TabularDataset) input types in the Spark scenario right now. You can make a conversion like this:
df = spark.sql("SELECT * FROM default.nyc_taxi_train")
datastore = Datastore.get_default(ws)
dataset = TabularDatasetFactory.register_spark_dataframe(df, datastore, name = experiment_name + "-dataset")
automl_config = AutoMLConfig(spark_context = sc,....)
Also curious to learn more about your use case and how you intend to use AutoML in Synapse. Please let me know if you would be interested to connect on that topic.
Thanks,
Nellie (from the Azure Synapse Team)

Resources