I am trying to load data into polybase table from csv flat file having "/,/|/^ data into it.
I have create file format with the (STRING_DELIMITER = '"')
CREATE EXTERNAL FILE FORMAT StringDelimiter WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
) );
But i got an error while fetching from blob storage:
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter.
Unfortunately, the escape character support is not yet available in synapse while loading using Polybase.
You can convert CSV flat file to Parquet file format in data factory.
Then to CREATE EXTERNAL FILE FORMAT use following query
CREATE EXTERNAL FILE FORMAT table_name
WITH
(
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)
Reference link1- https://learn.microsoft.com/en-us/answers/questions/118102/polybase-load-csv-file-that-contains-text-column-w.html
Reference link2- https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-for-create-external-file-format
Related
I am trying to load a csv file (without header) into a delta table using the Load the sample data from cloud storage into the table guideline but I can not find any instructions how to define source file schema/header.
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Bases on delta-copy-into and FORMAT_OPTIONS docs I assume, the enforceSchema would be the right option but how to privide the schema definition using SQL API?
If you don't have header in the files then Spark will assign names automatically, like, _c0, _c1, etc., and put them into the table. If you want give a meaningful names then you need to use slightly different syntax by using SELECT option that will give you ability to rename columns, and do type casting if necessary. Like this (cast just for example):
COPY INTO my_table FROM (
SELECT _c0 as col1, cast(_c1 as data) as date, _c2 as col3, ...
FROM '/path/to/files'
)
FILEFORMAT = <format>
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
P.S. I'm not sure that infrerSchema is good to use here, as you anyway may need to do casts, etc.
I have created new table with csv file with following code
%sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
create table if not exists catlog.schema.tablename;
COPY INTO catlog.schema.tablename
FROM (SELECT * FROM 's3://bucket/test.csv')
FILEFORMAT = CSV
FORMAT_OPTIONS ('mergeSchema' = 'true', 'header' = 'true')
but i have new file with additional data. how can i load that please guide?
thanks
need to load new datafile in delta table
I tried to reproduce the same in my environment and got the below
Make sure, to check whether the schema and file.csv data_type should match otherwise you will get an error.
Please follow below syntax insert data from csv file
%sql
copy into <catalog>.<schema>.<table_name>
from "<file_loaction>/file_3.csv"
FILEFORMAT = csv
FORMAT_OPTIONS('header'='true','inferSchema'='True');
I am not able to undertsand why we are writing with (doc nvarchar(max)) as rows
JSON FILE
{"date_rep":"2020-07-24","day":24,"month":7,"year":2020,"cases":3,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-25","day":25,"month":7,"year":2020,"cases":7,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-26","day":26,"month":7,"year":2020,"cases":4,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-27","day":27,"month":7,"year":2020,"cases":8,"deaths":0,"geo_id":"AF"}
My Code to open this JSON file start from here.
select
JSON_VALUE(doc, '$.date_rep') AS date_reported,
JSON_VALUE(doc, '$.countries_and_territories') AS country,
CAST(JSON_VALUE(doc, '$.deaths') AS INT) as fatal,
JSON_VALUE(doc, '$.cases') as cases,
doc
from openrowset(
bulk 'https://demoaccname.dfs.core.windows.net/demoadlscontainer/afg.json',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
jsonContent nvarchar(max) creates a external table with the column name jsonContent .
in
FROM
OPENROWSET(
BULK 'https://XXXXX',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0b'
) with (**jsonContent nvarchar(max)**) as rows
nvarchar(max) in SQL can be used to store any string types. when we are loading the JSON from source to SQL table, we don't know exactly the length of the JSON and JSON is a string type.
In SQL there is no specific datatype for JSON so we use WITH (jsonContent nvarchar(MAX)) AS [result] to store JSON from file to nvarchar column.
JSON documents can be stored as-is in NVARCHAR columns. This is the best way for quick data load and ingestion because the loading speed is matching loading of string columns.
I'm using Synapse Serverless Pool and get the following error trying to use CETAS
Msg 15860, Level 16, State 5, Line 3
External table location path is not valid. Location provided: 'https://accountName.blob.core.windows.net/ontainerName/test/'
My workspace managed identity should have all the correct ACL and RBAC roles on the storage account. I'm able to query the files I have there but is unable to execute the CETAS command.
CREATE DATABASE SCOPED CREDENTIAL WorkspaceIdentity WITH IDENTITY = 'Managed Identity'
GO
CREATE EXTERNAL DATA SOURCE MyASDL
WITH ( LOCATION = 'https://accountName.blob.core.windows.net/containerName'
,CREDENTIAL = WorkspaceIdentity)
GO
CREATE EXTERNAL FILE FORMAT CustomCSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (ENCODING = 'UTF8')
);
GO
CREATE EXTERNAL TABLE Test.dbo.TestTable
WITH (
LOCATION = 'test/',
DATA_SOURCE = MyASDL,
FILE_FORMAT = CustomCSV
) AS
WITH source AS
(
SELECT
jsonContent
, JSON_VALUE (jsonContent, '$.zipCode') AS ZipCode
FROM
OPENROWSET(
BULK '/customer-001-100MB.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x00',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '\n',
DATA_SOURCE = 'MyASDL'
)
WITH (
jsonContent varchar(1000) COLLATE Latin1_General_100_BIN2_UTF8
) AS [result]
)
SELECT ZipCode, COUNT(*) as Count
FROM source
GROUP BY ZipCode
;
If I've tried everything in the LOCATION parameter of the CETAS command, but nothing seems to work. Both folder paths, file paths, with and without leading / trailing / etc.
The CTE select statement works without the CETAS.
Can't I use the same data source for both reading and writing? or is it something else?
The issue was with my data source definition.
Where I had used https:\\ when I changed this to wasbs:\\ as per the following link TSQL CREATE EXTERNAL DATA SOURCE
Where it describes you have to use wasbs, abfs or adl depending on your data source type being a V2 storage account, V2 data lake or V1 data lake
How to Call rest Api call to notebook in Snowflake , As API call generates output files which are need to store in Snowflake itself
you do not have to call an API directly from Snowflake. You can load files directly from your python notebook with a connection to Snowflake DB:
SQL code:
-- Create destination table to store your API queries results data
create or replace table public.tmp_table
(page_id int,id int,status varchar,provider_status varchar,ts_created timestamp);
-- create a new format for your csv files
create or replace file format my_new_format type = 'csv' field_delimiter = ';' field_optionally_enclosed_by = '"' skip_header = 1;
-- Put your local file to the Snowflake's temporary storage
put file:///Users/Admin/Downloads/your_csv_file_name.csv #~/staged;
-- Copying data from storage into table
copy into public.tmp_table from #~/staged/your_csv_file_name.csv.gz file_format = my_new_format ON_ERROR=CONTINUE;
select * from public.tmp_table;
-- Delete temporary data
remove #~/staged/tmp_table.csv.gz;
You can do the same with python:
https://docs.snowflake.com/en/user-guide/python-connector-example.html#loading-data
target_table = 'public.tmp_table'
filename = 'your_csv_file_name'
filepath = f'/home/Users/Admin/Downloads/{filename}.csv'
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE,
schema=SCHEMA
)
conn.cursor().execute(f'put file://{filepath} #~/staged;')
result = conn.cursor().execute(f'''
COPY INTO {target_table}
FROM #~/staged/{filename}.gz
file_format = (format_name = 'my_new_format'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE_UNENCLOSED_FIELD = NONE) ON_ERROR=CONTINUE;
''')
conn.cursor().execute(f'REMOVE #~/staged/{filename}.gz;')