Handling spaces in the abfss using COPY INTO with Azure Databricks - azure

I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.
The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:
Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/
I googled the error and found articles advising to replace the space with "%20". This solution is not effective.
So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.
This is my current Databricks SQL Code:
COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
'header'='true',
'delimiter'=';',
'skipRows'='8',
'mergeSchema'='true', --Whether to infer the schema across multiple files and to merge the schema of each file
'encoding'='UTF-8',
'enforceSchema'='true', --Whether to forcibly apply the specified or inferred schema to the CSV files
'ignoreLeadingWhiteSpace'='true',
'ignoreTrailingWhiteSpace'='true',
'mode'='PERMISSIVE' --Parser mode around handling malformed records
)
COPY_OPTIONS (
'force' = 'true', --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
'mergeSchema'= 'true' --If set to true, the schema can be evolved according to the incoming data.
)

Trying to use the path where one of the folders has space, gave the same error.
To overcome this, you can specify the folder in PATTERN parameter as follows:
%sql
COPY INTO table1
FROM '/mnt/repro/op/'
FILEFORMAT = csv
PATTERN='has space/sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.
%sql
COPY INTO table2
FROM '/mnt/repro/op/has?space/'
FILEFORMAT = csv
PATTERN='sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.
dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')
%sql
COPY INTO table3
FROM '/FileStore/tables/'
FILEFORMAT = csv
PATTERN='mycsv.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

Related

Databricks SQL API: Load csv file without header

I am trying to load a csv file (without header) into a delta table using the Load the sample data from cloud storage into the table guideline but I can not find any instructions how to define source file schema/header.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Bases on delta-copy-into and FORMAT_OPTIONS docs I assume, the enforceSchema would be the right option but how to privide the schema definition using SQL API?
If you don't have header in the files then Spark will assign names automatically, like, _c0, _c1, etc., and put them into the table. If you want give a meaningful names then you need to use slightly different syntax by using SELECT option that will give you ability to rename columns, and do type casting if necessary. Like this (cast just for example):
COPY INTO my_table FROM (
SELECT _c0 as col1, cast(_c1 as data) as date, _c2 as col3, ...
FROM '/path/to/files'
)
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
P.S. I'm not sure that infrerSchema is good to use here, as you anyway may need to do casts, etc.

Write DDL to .sql file using Pandas

I am trying to extract the DDL of tables and store it in .sql files using pandas
The code I have tried is :
query = "show table tablename"
df = pd.read_sql(query, connect)
df.to_csv('xyz.sql', index=False, header=False, quoting=None)
This creates a .sql file with the DDL like this -
" CREATE TABLE .....
.... ; "
How do I write the file without the quotes, like -
CREATE TABLE .....
.... ;
Given a string s, such as "CREATE ...",
one can delete double-quote characters with:
s = s.replace('"', '')
And don't forget
maketrans,
which (with translate) is very good at efficiently
deleting unwanted characters from very long strings.

bring new data from csv file to delta table

I have created new table with csv file with following code
%sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
create table if not exists catlog.schema.tablename;
COPY INTO catlog.schema.tablename
FROM (SELECT * FROM 's3://bucket/test.csv')
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
but i have new file with additional data. how can i load that please guide?
thanks
need to load new datafile in delta table
I tried to reproduce the same in my environment and got the below
Make sure, to check whether the schema and file.csv data_type should match otherwise you will get an error.
Please follow below syntax insert data from csv file
%sql
copy into <catalog>.<schema>.<table_name>
from "<file_loaction>/file_3.csv"
FILEFORMAT = csv
FORMAT_OPTIONS('header'='true','inferSchema'='True');

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

Read parquet files from S3 folder using wildcard

I have S3 folders as below, each with parquet files:
s3://bucket/folder1/folder2/2020-02-26-12/key=Boston_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Springfield_20200223/
s3://bucket/folder1/folder2/2020-02-26-12/key=Toledo_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Philadelphia_20191203/
My goal is to be able to open the parquet files from '*_20200226' folders only.
I use a FOR loop to first gather a list/array of all files and then pass it to the READ operation into a DF in spark 2.4.
s3_files = []
PREFIX = "folder1/folder2/"
min_datetime = current_datetime - timedelta(hours=72)
while current_datetime >= min_datetime:
each_hour_prefix = min_datetime.strftime('%Y-%m-%d-%H')
if any(fname.key.endswith('.parquet') for fname in s3_bucket.objects.filter(Prefix=(PREFIX + each_hour_prefix))):
s3_files.append('s3://{bucket}/{prefix}'.format(bucket=INPUT_BUCKET_NAME, prefix=(PREFIX + each_hour_prefix + '/*')))
min_datetime = min_datetime + timedelta(hours=1)
spark.read.option('basePath',('s3://' + INPUT_BUCKET_NAME)).schema(fileSchema).parquet(*s3_files)
where fileSchema is the schema struct of the parquet files, s3_files is a array of all files I picked up by perusing through S3 folders above. The above FOR loop works but my goal is to read Boston_20200226 and Toledo_20200226 folders only. Is it possible to do wildcard searches like "folder1/folder2/2020-02-26-12/key=**_20200226*" or perhaps change the 'read.parquet' command in some way?
Thanks in advance.
Update:
I resorted to a rudimentary way of perusing through all the folders and only finding files that match pattern = '20200226'(not the most efficient way). I collect the keys in a list and then read each parquet file in a DF and perform a union at the end. Everything works fine except the 'key' column is not read in the final DF. It is part of the partitionBy() code that created these parquet files. Any idea on how can capture the 'key'?

Resources