Azure Databricks - Cannot export results from Databricks to blob - azure

I want to export my data from Databricks to Azure blob. My Databricks commands select some pdf from my blob, run Form Recognizer and export the output results in my blob.
Here is my code:
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
from azure.storage.blob import ContainerClient
container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
print(blob_url)
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
import pandas as pd
field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized#mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)
The code runs fine until the very last line. I get the following error message:
Directory already mounted or error. FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.
I tried using /dbfs:/ instead of /dbfs/ but I don't know what I am doing wrong.
How can I export my Databricks results to the blob?
Thank you

It looks like that you're trying to mount the storage that was already mounted. Really, mount operation should be done only once, and not done dynamically. You have several choices to implement it correctly:
unmount before mouting using the dbutils.fs.unmount("/mnt/pdf-recognized")
check if storage already mounted & only run mount if it's not mounted. Something like this (not tested)
mounts = [mount for mount in dbutils.fs.mounts()
if mount.mountPoint == "/mnt/pdf-recognized"]
if len(mounts) == 0:
dbutils.fs.mount(....)
you don't really need a mount - it has the "bad" property that anyone in the workspace can use it with permissions that was used for mounting. It could be just simpler to write results to local disk, and then copy file to necessary location using dbutils.fs.cp with wasbs protocol. Something like this:
df.to_csv(r"/tmp/my-output.csv", index=False)
spark.conf.set(account_key, dbutils.secrets.get(scope ="formrec", key="formreckey"))
dbutils.fs.cp("file:///tmp/my-output.csv"),
"wasbs://pdf-recognized#mystorageaccount.blob.core.windows.net/output.csv")

Related

How to unzip and load tsv file into Bigquery from gcs bucket

Below is the code to get the tsv.gz file from gcs and unzip the file and converting into comma separated csv file to load csv data into Bigquery.
storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
if blobs.name.endswith(".tsv.gz"):
source_file = blobs.name
uri = "gs://{}/{}".format(bucket_name, source_file)
gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
with gcs_file_system.open(uri) as f:
gzf = gzip.GzipFile(mode="rb", fileobj=f)
csv_table=pd.read_table(gzf)
csv_table.to_csv('GfG.csv',index=False)
Code seems not effective to load data into BQ as getting many issues. Thought doing wrong with the conversion of file. Please put you thoughts where it went wrong?
If your file is gzip (not zip, I mean gzip), and in Cloud Storage, don't load it, unzip it and stream load it.
You can directly load, as is, in BigQuery, it's magic!! Here a sample
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
autodetect=True, #Automatic schema
field_delimiter=",", # Use \t if your separator is tab in your TSV file
skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.

Unable to mount Azure ADLS Gen 2 on from Community Edition of Databricks : com.databricks.rpc.UnknownRemoteException: Remote exception occurred

I am trying to mount ADLS Gen 2 from my databricks Community Edition, but when I run the following code:
test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True)
I get the error:
com.databricks.rpc.UnknownRemoteException: Remote exception occurred:
I'm using the following code to mount ADLS Gen 2
def check(mntPoint):
a= []
for test in dbutils.fs.mounts():
a.append(test.mountPoint)
result = a.count(mntPoint)
return result
mount = "/mnt/lake"
if check(mount)==1:
resultMsg = "<div>%s is already mounted. </div>" % mount
else:
dbutils.fs.mount(
source = "wasbs://root#adlspretbiukadlsdev.blob.core.windows.net",
mount_point = mount,
extra_configs = {"fs.azure.account.key.adlspretbiukadlsdev.blob.core.windows.net":""})
resultMsg = "<div>%s was mounted. </div>" % mount
displayHTML(resultMsg)
ServicePrincipalID = 'xxxxxxxxxxx'
ServicePrincipalKey = 'xxxxxxxxxxxxxx'
DirectoryID = 'xxxxxxxxxxxxxxx'
Lake = 'adlsgen2'
# Combine DirectoryID into full string
Directory = "https://login.microsoftonline.com/{}/oauth2/token".format(DirectoryID)
# Create configurations for our connection
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": ServicePrincipalID,
"fs.azure.account.oauth2.client.secret": ServicePrincipalKey,
"fs.azure.account.oauth2.client.endpoint": Directory}
mount = "/mnt/lake"
if check(mount)==1:
resultMsg = "<div>%s is already mounted. </div>" % mount
else:
dbutils.fs.mount(
source = f"abfss://root#{Lake}.dfs.core.windows.net/",
mount_point = mount,
extra_configs = configs)
resultMsg = "<div>%s was mounted. </div>" % mount
I then try to read a dataframe in ADLS Gen 2 using the following:
dataPath = "/mnt/lake/RAW/DummyEventData/CommerceTools/"
test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True)
com.databricks.rpc.UnknownRemoteException: Remote exception occurred:
Any ideas?
Based on the stacktrace, most probably reason for that error is that you don't have Storage Blob Data Contributor (or Storage Blob Data Reader) role assigned for your service principal (as it's described in documentation). This role is different from usual "Contributor" role, and that's very confusing.

Calling Rest API in Snow Flake

How to Call rest Api call to notebook in Snowflake , As API call generates output files which are need to store in Snowflake itself
you do not have to call an API directly from Snowflake. You can load files directly from your python notebook with a connection to Snowflake DB:
SQL code:
-- Create destination table to store your API queries results data
create or replace table public.tmp_table
(page_id int,id int,status varchar,provider_status varchar,ts_created timestamp);
-- create a new format for your csv files
create or replace file format my_new_format type = 'csv' field_delimiter = ';' field_optionally_enclosed_by = '"' skip_header = 1;
-- Put your local file to the Snowflake's temporary storage
put file:///Users/Admin/Downloads/your_csv_file_name.csv #~/staged;
-- Copying data from storage into table
copy into public.tmp_table from #~/staged/your_csv_file_name.csv.gz file_format = my_new_format ON_ERROR=CONTINUE;
select * from public.tmp_table;
-- Delete temporary data
remove #~/staged/tmp_table.csv.gz;
You can do the same with python:
https://docs.snowflake.com/en/user-guide/python-connector-example.html#loading-data
target_table = 'public.tmp_table'
filename = 'your_csv_file_name'
filepath = f'/home/Users/Admin/Downloads/{filename}.csv'
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE,
schema=SCHEMA
)
conn.cursor().execute(f'put file://{filepath} #~/staged;')
result = conn.cursor().execute(f'''
COPY INTO {target_table}
FROM #~/staged/{filename}.gz
file_format = (format_name = 'my_new_format'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE_UNENCLOSED_FIELD = NONE) ON_ERROR=CONTINUE;
''')
conn.cursor().execute(f'REMOVE #~/staged/{filename}.gz;')

How to read csv file with using pandas and cloud functions in GCP?

I just try to read csv file which was upload to GCS.
I want to read csv file which is upload to GCS with Cloud functions in GCP. And I want to deal with the csv data as "DataFrame".
But I can't read csv file by using pandas.
This is the code to read csv file on the GCS with using cloud functions.
def read_csvfile(data, context):
try:
bucket_name = "my_bucket_name"
file_name = "my_csvfile_name.csv"
project_name = "my_project_name"
# create gcs client
client = gcs.Client(project_name)
bucket = client.get_bucket(bucket_name)
# create blob
blob = gcs.Blob(file_name, bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
print(train.head())
except Exception as e:
print("error:{}".format(e))
When I ran my Python code, I got the following error.
No columns to parse from file
Some websites says that the error means I read un empty csv file. But actually I upload non empty csv file.
So how can I solve this problem?
please give me your help. Thanks.
----add at 2020/08/08-------
Thank you for giving me your help!
But finally I cloud not read csv file by using your code... I still have the error, No columns to parse from file.
So I tried new way to read csv file as Byte type. The new Python code to read csv file is bellow.
MAIN.PY
from google.cloud import storage
import pandas as pd
import io
import csv
from io import BytesIO
def check_columns(data, context):
try:
object_name = data['name']
bucket_name = data['bucket']
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(object_name)
data = blob.download_as_string()
#read the upload csv file as Byte type.
f = io.StringIO(str(data))
df = pd.read_csv(f, encoding = "shift-jis")
print("df:{}".format(df))
print("df.columns:{}".format(df.columns))
print("The number of columns:{}".format(len(df.columns)))
REQUIREMENTS.TXT
Click==7.0
Flask==1.0.2
itsdangerous==1.1.0
Jinja2==2.10
MarkupSafe==1.1.0
Pillow==5.4.1
qrcode==6.1
six==1.12.0
Werkzeug==0.14.1
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0
The output I got is bellow.
df:Empty DataFrame
Columns: [b'Apple, Lemon, Orange, Grape]
Index: []
df.columns:Index(['b'Apple', 'Lemon', 'Orange', 'Grape'])
The number of columns:4
So I could read only first record in csv file as df.column!? But I could not get the other records in csv file...And the first column is not the column but normal record.
So how can I get some records in csv file as DataFrame with using pandas?
Could you help me again? Thank you.
Pandas, since version 0.24.1, can directly read a Google Cloud Storage URI.
For example:
gs://awesomefakebucket/my.csv
Your service account attached to your function must have access to read the CSV file.
Please, feel free to test and modify this code.
I used Python 3.7
function.py
from google.cloud import storage
import pandas as pd
def hello_world(request):
# it is mandatory initialize the storage client
client = storage.Client()
#please change the file's URI
temp = pd.read_csv('gs://awesomefakebucket/my.csv', encoding='utf-8')
print (temp.head())
return f'check the results in the logs'
requirements.txt
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0

loading data into delta lake from azure blob storage

I am trying to load data into delta lake from azure blob storage.
I am using below code snippet
storage_account_name = "xxxxxxxxdev"
storage_account_access_key = "xxxxxxxxxxxxxxxxxxxxx"
file_location = "wasbs://bicc-hdspk-eus-qc#xxxxxxxxdev.blob.core.windows.net/FSHC/DIM/FSHC_DIM_SBU"
file_type = "csv"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
dx = df.write.format("parquet")
Till this step it is working and I am also able to load it into databricks table.
dx.write.format("delta").save(file_location)
error : AttributeError: 'DataFrameWriter' object has no attribute 'write'
p.s. - Am I passing the file location wrong into the write statement? If this is the cause then what is file path for delta lake.
Please revert to me in case additional information is needed.
Thanks,
Abhirup
dx is a dataframewriter, so what youre trying to do doesnt make sense. You could do this:
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
df.write.format("parquet").save()
df.write.format("delta").save()

Resources