Azure Synapse TSQL - azure

I am new to Azure Synapse and had a question about how the files are setup on Azure while creating an external table from a select. Would the files be over-written or would one need to truncate the files every time a create external table script is run? For e.g. if I run the following script
CREATE EXTERNAL TABLE [dbo].[PopulationCETAS] WITH (
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
) AS
SELECT
*
FROM
OPENROWSET(
BULK 'csv/population-unix/population.csv',
DATA_SOURCE = 'sqlondemanddemo',
FORMAT = 'CSV', PARSER_VERSION = '2.0'
) WITH (
CountryCode varchar(4),
CountryName varchar(64),
Year int,
PopulationCount int
) AS r;
Would the file created
LOCATION = 'populationParquet/',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
be overwritten every time the script is run? Can this be specified at the time of setup or within the query options?

I would love to be able to drop the files in storage with a DELETE or TRUNCATE operation but this feature doesn’t currently exist within T-SQL. Please vote for this feature.
In the meantime you will need to use outside automation like an Azure Data Factory pipeline.

Related

Azure Synapse Serverless CETAS error "External table location is not valid"

I'm using Synapse Serverless Pool and get the following error trying to use CETAS
Msg 15860, Level 16, State 5, Line 3
External table location path is not valid. Location provided: 'https://accountName.blob.core.windows.net/ontainerName/test/'
My workspace managed identity should have all the correct ACL and RBAC roles on the storage account. I'm able to query the files I have there but is unable to execute the CETAS command.
CREATE DATABASE SCOPED CREDENTIAL WorkspaceIdentity WITH IDENTITY = 'Managed Identity'
GO
CREATE EXTERNAL DATA SOURCE MyASDL
WITH ( LOCATION = 'https://accountName.blob.core.windows.net/containerName'
,CREDENTIAL = WorkspaceIdentity)
GO
CREATE EXTERNAL FILE FORMAT CustomCSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (ENCODING = 'UTF8')
);
GO
CREATE EXTERNAL TABLE Test.dbo.TestTable
WITH (
LOCATION = 'test/',
DATA_SOURCE = MyASDL,
FILE_FORMAT = CustomCSV
) AS
WITH source AS
(
SELECT
jsonContent
, JSON_VALUE (jsonContent, '$.zipCode') AS ZipCode
FROM
OPENROWSET(
BULK '/customer-001-100MB.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x00',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '\n',
DATA_SOURCE = 'MyASDL'
)
WITH (
jsonContent varchar(1000) COLLATE Latin1_General_100_BIN2_UTF8
) AS [result]
)
SELECT ZipCode, COUNT(*) as Count
FROM source
GROUP BY ZipCode
;
If I've tried everything in the LOCATION parameter of the CETAS command, but nothing seems to work. Both folder paths, file paths, with and without leading / trailing / etc.
The CTE select statement works without the CETAS.
Can't I use the same data source for both reading and writing? or is it something else?
The issue was with my data source definition.
Where I had used https:\\ when I changed this to wasbs:\\ as per the following link TSQL CREATE EXTERNAL DATA SOURCE
Where it describes you have to use wasbs, abfs or adl depending on your data source type being a V2 storage account, V2 data lake or V1 data lake

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 returns zero rows

Why does an Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 return many rows for a single file source, but zero rows for the parent folder source?
I created:
Master Key (CREATE MASTER KEY;)
Credential (CREATE DATABASE SCOPED
CREDENTIAL) - uses the ADLS Gen 2 account key
External data source (CREATE EXTERNAL DATA SOURCE)
File format (CREATE EXTERNAL FILE FORMAT)
External table (CREATE EXTERNAL TABLE)
Everything works fine when my external table points to a specific file, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/time001.txt',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Many rows returned, therefore I am confident all items mentioned above are configured correctly.
The Time folder in Azure Data Lake Gen 2 contains many files, not just time001.txt. When I change my external table to point at a folder, and not an individual file, the query returns zero rows, i.e.
CREATE EXTERNAL TABLE [ext].[Time]
(
[TimeID] int NOT NULL,
[HourNumber] tinyint NOT NULL,
[MinuteNumber] tinyint NOT NULL,
[SecondNumber] tinyint NOT NULL,
[TimeInSecond] int NOT NULL,
[HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
WITH
(
LOCATION = '/Time/',
DATA_SOURCE = ADLSDataSource,
FILE_FORMAT = uncompressedcsv,
REJECT_TYPE = value,
REJECT_VALUE = 2147483647
);
SELECT * FROM [ext].[Time];
Zero rows returned
I tried:
LOCATION = '/Time/',
LOCATION = '/Time',
LOCATION = 'Time/',
LOCATION = 'Time',
But always zero rows. I also followed the instructions at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store
I tested all the files within the folder and individually each returns many rows of data.
I queried all the files from Blob storage and not ADLS Gen2 and the "Folder" query returns all rows as expected.
How do I query all files in a folder "as one" from Azure Data Lake Gen2 storage using Azure SQL Data Warehouse and Polybase?
I was facing the exactly same issue: the problem was on the Data Source protocol.
Script with error:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.blob.core.windows.net',
CREDENTIAL = credential_name
Script that solves issue:
CREATE EXTERNAL DATA SOURCE datasourcename
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://container#storage.dfs.core.windows.net',
CREDENTIAL = credential_name
The only change needed was the LOCATION.
Thanks to the Microsoft Support Team for helping me on this.

Create table in Athena using all objects from multiple folders in S3 Bucket via Boto3

My S3 Bucket has multiple sub-directories that store data for multiple websites based on the day.
example:
bucket/2020-01-03/website 1 and within this are where the csv's are stored.
I am able to create tables based on each of the objects but I want to create one consolidated table for all sub-directories/objects/data stored within the prefix bucket/2020-01-03 for all websites as well as all other dates.
I used the code below to create one table for
Athena configuration
athena = boto3.client('athena',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY,
region_name= 'us-west-2')
s3_input = 's3://bucket/2020-01-03/website1'
database = 'database1'
table = 'consolidated_table'
Athena database and table definition
create_table = \
"""CREATE EXTERNAL TABLE IF NOT EXISTS `%s.%s` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
) LOCATION '%s'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1576774420');""" % ( database, table, s3_input )
athena.start_query_execution(QueryString=create_table,
WorkGroup = 'user_group',
QueryExecutionContext={'Database': 'database1'},
ResultConfiguration={'OutputLocation': 's3://aws-athena-query-results-5000-us-west-2'})
I also want to over-write this table with new data from S3 everytime I run it.
You can have a consolidated table for the files from different "directories" on S3 only if all of them adhere the same data schema. As I can see from your CREATE EXTERNAL TABLE, each file contains 4 columns website_id, user, action and date. So you can simply change LOCATION to point to the root of your S3 "directory structure"
CREATE EXTERNAL TABLE IF NOT EXISTS `database1`.`consolidated_table` (
`website_id` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`action` string COMMENT 'from deserializer',
`date` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\"', 'separatorChar'=','
)
LOCATION 's3://bucket' -- instead of restricting it to s3://bucket/2020-01-03/website1
TBLPROPERTIES (
'skip.header.line.count'='1'
);
In this case, each Athena query would scan all files under s3://bucket location and you can use website_id and date in WHERE clause to filter results. However, if you have a lot of data you should consider partitioning. It will save you not only time to execute query but also money (see this post)
I also want to over-write this table with new data from S3 every time I run it.
I assume you mean, that every time you run Athena query, it should scan files on S3 even if they were added after you executed CREATE EXTERNAL TABLE. Note, that CREATE EXTERNAL TABLE simply defines a meta information about you data, i.e. where it is located on S3, columns etc. Thus, query against table with LOCATION 's3://bucket' (w/o partitioning) will always include all your S3 files

Cannot create EXTERNAL TABLE on Azure SQL Databse

Can you not create an External table on an Azure SQL Database with a format file? I'm trying to create an external table to a table I dumped into blob storage.
From this page: https://msdn.microsoft.com/en-us/library/dn935021.aspx
-- Create a new external table
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
[ , <reject_options> [ ,...n ] ]
)
[;]
Is the documentation incorrect or am I missing something? I can't seem to create a FORMAT FILE and keep receiving
"Incorrect syntax near 'EXTERNAL'." error.
CREATE EXTERNAL FILE FORMAT [DelimitedText]
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = N'~¶~',
USE_TYPE_DEFAULT = False
),
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.GzipCodec')
GO
The problem is (probably) that you are trying to use PolyBase on an Azure SQL Database, but PolyBase is only supported on on SQL Service 2016 on-premise. It is, however, supported on Azure SQL Datawarehouse: PolyBase Versioned Feature Summary
If you instead of an Azure SQL Database create an Azure SQL Datawarehouse you should have the PolyBase features available, including creating an external file format.
Running this:
CREATE EXTERNAL FILE FORMAT TextFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = N'~¶~',
USE_TYPE_DEFAULT = False
),
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.GzipCodec')
GO
On an Azure SQL Database will give you an error like:
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near 'EXTERNAL'.
Running the same thing on Azure SQL Data warehouse will work
Command(s) completed successfully.
You will not be able to work with Hadoop databases using Azure SQL data warehouse, but working with Azure blob Storage is supported.

Resources