I'm using Synapse Serverless Pool and get the following error trying to use CETAS
Msg 15860, Level 16, State 5, Line 3
External table location path is not valid. Location provided: 'https://accountName.blob.core.windows.net/ontainerName/test/'
My workspace managed identity should have all the correct ACL and RBAC roles on the storage account. I'm able to query the files I have there but is unable to execute the CETAS command.
CREATE DATABASE SCOPED CREDENTIAL WorkspaceIdentity WITH IDENTITY = 'Managed Identity'
GO
CREATE EXTERNAL DATA SOURCE MyASDL
WITH ( LOCATION = 'https://accountName.blob.core.windows.net/containerName'
,CREDENTIAL = WorkspaceIdentity)
GO
CREATE EXTERNAL FILE FORMAT CustomCSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (ENCODING = 'UTF8')
);
GO
CREATE EXTERNAL TABLE Test.dbo.TestTable
WITH (
LOCATION = 'test/',
DATA_SOURCE = MyASDL,
FILE_FORMAT = CustomCSV
) AS
WITH source AS
(
SELECT
jsonContent
, JSON_VALUE (jsonContent, '$.zipCode') AS ZipCode
FROM
OPENROWSET(
BULK '/customer-001-100MB.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x00',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '\n',
DATA_SOURCE = 'MyASDL'
)
WITH (
jsonContent varchar(1000) COLLATE Latin1_General_100_BIN2_UTF8
) AS [result]
)
SELECT ZipCode, COUNT(*) as Count
FROM source
GROUP BY ZipCode
;
If I've tried everything in the LOCATION parameter of the CETAS command, but nothing seems to work. Both folder paths, file paths, with and without leading / trailing / etc.
The CTE select statement works without the CETAS.
Can't I use the same data source for both reading and writing? or is it something else?
The issue was with my data source definition.
Where I had used https:\\ when I changed this to wasbs:\\ as per the following link TSQL CREATE EXTERNAL DATA SOURCE
Where it describes you have to use wasbs, abfs or adl depending on your data source type being a V2 storage account, V2 data lake or V1 data lake
Related
I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'
I am new to Azure Synapse and had a question about how the files are setup on Azure while creating an external table from a select. Would the files be over-written or would one need to truncate the files every time a create external table script is run? For e.g. if I run the following script
CREATE EXTERNAL TABLE [dbo].[PopulationCETAS] WITH (
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
) AS
SELECT
*
FROM
OPENROWSET(
BULK 'csv/population-unix/population.csv',
DATA_SOURCE = 'sqlondemanddemo',
FORMAT = 'CSV', PARSER_VERSION = '2.0'
) WITH (
CountryCode varchar(4),
CountryName varchar(64),
Year int,
PopulationCount int
) AS r;
Would the file created
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
be overwritten every time the script is run? Can this be specified at the time of setup or within the query options?
I would love to be able to drop the files in storage with a DELETE or TRUNCATE operation but this feature doesn’t currently exist within T-SQL. Please vote for this feature.
In the meantime you will need to use outside automation like an Azure Data Factory pipeline.
Why does an Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 return many rows for a single file source, but zero rows for the parent folder source?
I created:
Master Key (CREATE MASTER KEY;)
Credential (CREATE DATABASE SCOPED
CREDENTIAL) - uses the ADLS Gen 2 account key
External data source (CREATE EXTERNAL DATA SOURCE)
File format (CREATE EXTERNAL FILE FORMAT)
External table (CREATE EXTERNAL TABLE)
Everything works fine when my external table points to a specific file, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/time001.txt',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Many rows returned, therefore I am confident all items mentioned above are configured correctly.
The Time folder in Azure Data Lake Gen 2 contains many files, not just time001.txt. When I change my external table to point at a folder, and not an individual file, the query returns zero rows, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Zero rows returned
I tried:
LOCATION = '/Time/',
LOCATION = '/Time',
LOCATION = 'Time/',
LOCATION = 'Time',
But always zero rows. I also followed the instructions at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store
I tested all the files within the folder and individually each returns many rows of data.
I queried all the files from Blob storage and not ADLS Gen2 and the "Folder" query returns all rows as expected.
How do I query all files in a folder "as one" from Azure Data Lake Gen2 storage using Azure SQL Data Warehouse and Polybase?
I was facing the exactly same issue: the problem was on the Data Source protocol.
Script with error:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.blob.core.windows.net',
CREDENTIAL = credential_name
Script that solves issue:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.dfs.core.windows.net',
CREDENTIAL = credential_name
The only change needed was the LOCATION.
Thanks to the Microsoft Support Team for helping me on this.
I am trying to fetch some data from azure data lake to azure datawarehouse, but I am unable to do it I have followed the documentation link
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store
But I am getting this error when I am trying to create an external table, I have created another web/api app but still was not able to access thE application here is the error which I am facing
EXTERNAL TABLE access failed due to internal error: 'Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [0ec4b8e0-b16d-470e-9c98-37818176a188][2017-08-14T02:30:58.9795172-07:00]: Error [GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [0ec4b8e0-b16d-470e-9c98-37818176a188][2017-08-14T02:30:58.9795172-07:00]] occurred while accessing external file.'
Here is the script which I am trying to get it to work with
CREATE DATABASE SCOPED CREDENTIAL ADLCredential2
WITH
IDENTITY = '2ec11315-5a30-4bea-9428-e511bf3fa8a1#https://login.microsoftonline.com/24708086-c2ce-4b77-8d61-7e6fe8303971/oauth2/token',
SECRET = '3Htr2au0b0wvmb3bwzv1FekK88YQYZCUrJy7OB3NzYs='
;
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore11
WITH (
TYPE = HADOOP,
LOCATION = 'adl://test.azuredatalakestore.net/',
CREDENTIAL = ADLCredential2
);
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);
CREATE EXTERNAL TABLE [extccsm].[external_medication]
(
person_id varchar(4000),
encounter_id varchar(4000),
fin varchar(4000),
mrn varchar(4000),
icd_code varchar(4000),
icd_description varchar(300),
priority integer,
optional1 varchar(4000),
optional2 varchar(4000),
optional3 varchar(4000),
load_identifier varchar(4000),
upload_time datetime2,
xx_person_id varchar(4000),--Person ID is the ID that we will use to represent the person through out the process uniquely, This requires initial analysis to determine how to set it
xx_encounter_id varchar(4000),--Encounter ID is the ID that will represent the encounter uniquely through out the process, This requires initial analysis to determine hos to set it based on client data
mod_optional1 varchar(4000),
mod_optional2 varchar(4000),
mod_optional3 varchar(4000),
mod_optional4 varchar(4000),
mod_optional5 varchar(4000),
mod_loadidentifier datetime2
)
WITH
(
LOCATION='\testfiles\procedure_azure.txt000\',
DATA_SOURCE = AzureDataLakeStore11, --DATA SOURCE THE BLOB STORAGE
FILE_FORMAT = TextFileFormat, --TYPE OF FILE FORMAT
REJECT_TYPE = percentage,
REJECT_VALUE = 1,
REJECT_SAMPLE_VALUE = 0
);
Please tell me whats wrong here?
I can reproduce this but it's hard to narrow down exactly. I think it's to do with permissions. From the Azure portal:
Data Lake Store > yourDataLakeAccount > your folder > Access
From there, make sure your AD Application has Read, Write and Execute permission on the relevant files / folders. Start with one file initially. I can reproduce the error by assigning / unassigning the Execute permissions but need to repeat the steps to confirm. I'll retrace my steps but for now concentrate your search here. In my example below, my Azure Active Directory Application is called adwAndPolybase; you can see I've given it Read, Write and Execute. I also experimented with the Advanced and 'Apply to children' options:
Can you not create an External table on an Azure SQL Database with a format file? I'm trying to create an external table to a table I dumped into blob storage.
From this page: https://msdn.microsoft.com/en-us/library/dn935021.aspx
-- Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[ , <reject_options> [ ,...n ] ]
)
[;]
Is the documentation incorrect or am I missing something? I can't seem to create a FORMAT FILE and keep receiving
"Incorrect syntax near 'EXTERNAL'." error.
CREATE EXTERNAL FILE FORMAT [DelimitedText]
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = N'~¶~',
USE_TYPE_DEFAULT = False
),
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.GzipCodec')
GO
The problem is (probably) that you are trying to use PolyBase on an Azure SQL Database, but PolyBase is only supported on on SQL Service 2016 on-premise. It is, however, supported on Azure SQL Datawarehouse: PolyBase Versioned Feature Summary
If you instead of an Azure SQL Database create an Azure SQL Datawarehouse you should have the PolyBase features available, including creating an external file format.
Running this:
CREATE EXTERNAL FILE FORMAT TextFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = N'~¶~',
USE_TYPE_DEFAULT = False
),
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.GzipCodec')
GO
On an Azure SQL Database will give you an error like:
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near 'EXTERNAL'.
Running the same thing on Azure SQL Data warehouse will work
Command(s) completed successfully.
You will not be able to work with Hadoop databases using Azure SQL data warehouse, but working with Azure blob Storage is supported.