bring new data from csv file to delta table - databricks

I have created new table with csv file with following code
%sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
create table if not exists catlog.schema.tablename;
COPY INTO catlog.schema.tablename
FROM (SELECT * FROM 's3://bucket/test.csv')
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
but i have new file with additional data. how can i load that please guide?
thanks
need to load new datafile in delta table

I tried to reproduce the same in my environment and got the below
Make sure, to check whether the schema and file.csv data_type should match otherwise you will get an error.
Please follow below syntax insert data from csv file
%sql
copy into <catalog>.<schema>.<table_name>
from "<file_loaction>/file_3.csv"
FILEFORMAT = csv
FORMAT_OPTIONS('header'='true','inferSchema'='True');

Related

Databricks SQL API: Load csv file without header

I am trying to load a csv file (without header) into a delta table using the Load the sample data from cloud storage into the table guideline but I can not find any instructions how to define source file schema/header.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Bases on delta-copy-into and FORMAT_OPTIONS docs I assume, the enforceSchema would be the right option but how to privide the schema definition using SQL API?
If you don't have header in the files then Spark will assign names automatically, like, _c0, _c1, etc., and put them into the table. If you want give a meaningful names then you need to use slightly different syntax by using SELECT option that will give you ability to rename columns, and do type casting if necessary. Like this (cast just for example):
COPY INTO my_table FROM (
SELECT _c0 as col1, cast(_c1 as data) as date, _c2 as col3, ...
FROM '/path/to/files'
)
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
P.S. I'm not sure that infrerSchema is good to use here, as you anyway may need to do casts, etc.

Write DDL to .sql file using Pandas

I am trying to extract the DDL of tables and store it in .sql files using pandas
The code I have tried is :
query = "show table tablename"
df = pd.read_sql(query, connect)
df.to_csv('xyz.sql', index=False, header=False, quoting=None)
This creates a .sql file with the DDL like this -
" CREATE TABLE .....
.... ; "
How do I write the file without the quotes, like -
CREATE TABLE .....
.... ;
Given a string s, such as "CREATE ...",
one can delete double-quote characters with:
s = s.replace('"', '')
And don't forget
maketrans,
which (with translate) is very good at efficiently
deleting unwanted characters from very long strings.

Handling spaces in the abfss using COPY INTO with Azure Databricks

I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.
The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:
Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/
I googled the error and found articles advising to replace the space with "%20". This solution is not effective.
So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.
This is my current Databricks SQL Code:
COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://gpdi-files#hgbsprodgbsflastorage01.dfs.core.windows.net/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
'header'='true',
'delimiter'=';',
'skipRows'='8',
'mergeSchema'='true', --Whether to infer the schema across multiple files and to merge the schema of each file
'encoding'='UTF-8',
'enforceSchema'='true', --Whether to forcibly apply the specified or inferred schema to the CSV files
'ignoreLeadingWhiteSpace'='true',
'ignoreTrailingWhiteSpace'='true',
'mode'='PERMISSIVE' --Parser mode around handling malformed records
)
COPY_OPTIONS (
'force' = 'true', --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
'mergeSchema'= 'true' --If set to true, the schema can be evolved according to the incoming data.
)
Trying to use the path where one of the folders has space, gave the same error.
To overcome this, you can specify the folder in PATTERN parameter as follows:
%sql
COPY INTO table1
FROM '/mnt/repro/op/'
FILEFORMAT = csv
PATTERN='has space/sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.
%sql
COPY INTO table2
FROM '/mnt/repro/op/has?space/'
FILEFORMAT = csv
PATTERN='sample1.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');
Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.
dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')
%sql
COPY INTO table3
FROM '/FileStore/tables/'
FILEFORMAT = csv
PATTERN='mycsv.csv'
FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
COPY_OPTIONS ('mergeSchema' = 'true');

Could not find a delimiter after string delimiter

I am trying to load data into polybase table from csv flat file having "/,/|/^ data into it.
I have create file format with the (STRING_DELIMITER = '"')
CREATE EXTERNAL FILE FORMAT StringDelimiter WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
) );
But i got an error while fetching from blob storage:
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter.
Unfortunately, the escape character support is not yet available in synapse while loading using Polybase.
You can convert CSV flat file to Parquet file format in data factory.
Then to CREATE EXTERNAL FILE FORMAT use following query
CREATE EXTERNAL FILE FORMAT table_name
WITH
(
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)
Reference link1- https://learn.microsoft.com/en-us/answers/questions/118102/polybase-load-csv-file-that-contains-text-column-w.html
Reference link2- https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-for-create-external-file-format

Extract data from Excel sheet to multiple SQL Server tables

I have an Excel file with a few columns (20) and some data that I need to upload into 4 SQL Server tables. The tables are related and specific columns represents my id for each table.
Is there an ETL tool that I can use to automate this process?
This query uses bulk insert to store the file in a #temptable
and then inserts the contents from this temp table into the table you want in the database, however the file being imported is .csv. you can just save your excel file as csv, before doing this.
CREATE TABLE #temptable (col1,col2,col3)
BULK INSERT #temptable from 'C:\yourfilelocation\yourfile.csv'
WITH
(
FIRSTROW = 2,
fieldterminator = ',',
rowterminator = '0x0A'
) `
INSERT INTO yourTableInDataBase (col1,col2,col3)
SELECT (col1,col2,col3)
FROM #temptable
To automate this, you can put this inside a stored procedure and call the stored procedure using batch.Edit this code and put this inside textfile and save as cmd
set MYDB= yourDBname
set MYUSER=youruser
set MYPASSWORD=yourpassword
set MYSERVER=yourservername
sqlcmd -S %MYSERVER% -d %MYDB% -U %MYUSER% -P %MYPASSWORD% -h -1 -s "," -W -Q "exec yourstoredprocedure"

Resources