Is there anyway to create External table in Azure SQL DWH even though the location path mentioned in external table statement doesn't exist.
For eg:- location '/src/temp' doesn't exist still I want external table to be created.
create external table ext.dummy(
PERSON_ID varchar(500) ,
ASSIGNMENT_ID varchar(500)
)
WITH
(
LOCATION='/src/temp',
DATA_SOURCE = YasCdpBlobStorage,
FILE_FORMAT = ExtTableTextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
No, this is not possible. The external location whether it is a folder or a filepath must exist before you create the external table.
Although the documentation does not state this explicitlly, it is implied by the term "actual", ie
LOCATION = 'folder_or_filepath'
Specifies the folder or the file path
and file name for the actual data in Hadoop or Azure blob storage.
Related
Using Azure Synapse , Dedicated SQL pool.
How can I structure my tables underneath a button that represents the schema?
This is a small issue that really makes a big impact when many tables and schemas will be used in the database, and users will need to navigate to the correct schema quickly.
I tried dragging the schema under over the tables section, but nothing worked.
to structure the table first we need to create schema in sql dedicated pool using below command
CREATE SCHEMA <schemaName>
we need to create table using above created schema with required columns with suitable data types to that column.
Table creation:
CREATE TABLE <tableName>(col1 dataType,col2 dataType)
I created external table in dedicated sql pool in synapse following below steps:
Schema creation:
Created external data source:
CREATE EXTERNAL DATA SOURCE [DATASOURCE] WITH
(
LOCATION = '<location>',
)
Image for reference:
created external file format:
CREATE EXTERNAL FILE FORMAT [FileFormat1] WITH
(
FORMAT_TYPE = DELIMITEDTEXT
)
Image for reference:
I created external table using above data source and file format using below code:
CREATE EXTERNAL TABLE [wwi].[information2]
(
[Id] INT
)
WITH
(
LOCATION = '<folder/file>',
DATA_SOURCE = [DATASOURCE1],
FILE_FORMAT = [FileFormat1]
)
Image for reference:
In this we can structure the table in synapse dedicated pool.
I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.
This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:
Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)
Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.
The code I am using to query the data lake is below:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
Here the data source, analytics is a data source specified as follows:
CREATE EXTERNAL DATA SOURCE analytics
WITH
(
location = 'https://<url>.dfs.core.windows.net/analytics'
)
I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.
The query is running here on the built-in serverless pool.
Any ideas on how to get around this issue would be appreciated.
Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.
-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH ( LOCATION = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
CREDENTIAL = SynapseIdentity
)
Also see the sections in that article about the firewall on your storage account if you have that locked down.
Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.
I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
or for external tables:
TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This made my final query:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics_master_key',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
I am newbie to Azure Data Factory. I'm trying to load multiple files of various states from FTP location into a single Azure SQL Server table. My requirement is to get state name from of the file and dump it into table along with actual data.
Currently, my source is FTP. Sink is Azure SQL Server table. I have used Stored Procedure to load the data. However, I'm unable to send file name as a parameter as shown below to the stored procedure so that I can dump it into the table. Below is the Copy Data component -
I have defined SourceFileName parameter in stored procedure, however, I am unable to send it via COPY Data activity.
Any help is appreciated.
We can conclude that additional column option can not be used here. Because ADF will return a column(contain filepath) not a string. So we need to use GetMetaData activity to get the file list, then foreach the file list and inside a Foreach activity to copy them.
I've created a simple test, it works well.
In my local FTP server, there is two text files. I need to copy them into an Azure SQL table.
At GetMetaData activity, I use Child Items to get the filelist.
At ForEach activity, I use #activity('Get Metadata1').output.childItems to foreach the file list.
Inside the ForEach activity, I use dynamic content #item().name to get the file path.
source setting:
sink setting:
So we can get the filename. Follows are some operations I did on Azure SQL.
-- create a table
CREATE TABLE [dbo].[employee](
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL,
[filePath] [varchar](50) NULL
) ON [PRIMARY]
GO
-- create a table type
CREATE TYPE [dbo].[ct_employees_type] AS TABLE(
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL
)
GO
-- create a Stored procedure
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[spUpsertEmployees]
#employees ct_employees_type READONLY,
#filePath varchar(50)
AS
BEGIN
set #filePath = SUBSTRING( #filePath,1,len(#filePath)-4)
MERGE [dbo].[employee] AS target_sqldb
USING #employees AS source_tblstg
ON (target_sqldb.firstName = source_tblstg.firstName)
WHEN MATCHED THEN
UPDATE SET
firstName = source_tblstg.firstName,
lastName = source_tblstg.lastName
WHEN NOT MATCHED THEN
INSERT (
firstName,
lastName,
filePath
)
VALUES (
source_tblstg.firstName,
source_tblstg.lastName,
#filePath
);
END
GO
After I run debug, the result is follows:
We have CSV files in Azure Data lake Gen 2 under partitioned folder, so there will be multiple CSV files for a single large table. We want to consume these files in Azure Data Explorer by creating an external table. so I am using below script to create an external table in ADX:
.create external table TestAdx
(
id: int,
name: string,
designation: string
)
kind=adl
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/staging/textadx;token=<<generating using .net API>>'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
namePrefix="Prefix"
)
I am able to execute this query and the external table is created but when I try to fetch data from this table it is giving below error:
Semantic error: 'TestAdx' has the following semantic error: ''
operator: Failed to resolve table or column or scalar expression named
'TestAdx'.
Also please let me know is this the correct approach to work with ADLS Gen2 file form ADX?
what is the query you're running? are you using the external_table() function?
You need to use external_table("TestAdx") to access the external table.
Following is the example for creating external table with Azure Data Explorer with Azure Data Lake Gen 2. I have added the partition key and other parameters.
.create external table BugsCSV
(
Column1 : string,
Column2 : string,
Column3 : string
)
kind=adl
partition by "State="State
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/path;key'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
compressed=true,
compressiontype="lz4"
)
I am trying to execute this query but as userdefined(Create type) types are not supportable in azure data warehouse. and i want to use it in stored procedure.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
CREATE PROCEDURE usp_upsert_customer_table #customer_table DataTypeforCustomerTable READONLY
AS
BEGIN
MERGE customer_table AS target
USING #customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
CREATE TYPE DataTypeforProjectTable AS TABLE(
Project varchar(255),
Creationtime datetime
);
GO
CREATE PROCEDURE usp_upsert_project_table #project_table DataTypeforProjectTable READONLY
AS
BEGIN
MERGE project_table AS target
USING #project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Is there any alternative way to do this.
You've got a few challenges there, because most of what you're trying to convert is not the way to do things on ASDW.
First, as you point out, CREATE TYPE is not supported, and there is no equivalent alternative.
Next, the code appears to be doing single inserts to a table. That's really bad on ASDW, performance will be dreadful.
Next, there's no MERGE statement (yet) for ASDW. That's because UPDATE is not the best way to handle changing data.
And last, stored procedures work a little differently on ASDW, they're not compiled, but interpreted each time the procedure is called. Stored procedures are great for big chunks of table-level logic, but not recommended for high volume calls with single-row operations.
I'd need to know more about the use case to make specific recommendations, but in general you need to think in tables rather than rows. In particular, focus on the CREATE TABLE AS (CTAS) way of handling your ELT.
Here's a good link, it shows how the equivalent of a Merge/Upsert can be handled using a CTAS:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-ctas#replace-merge-statements
As you'll see, it processes two tables at a time, rather than one row. This means you'll need to review the logic that called your stored procedure example.
If you get your head around doing everything in CTAS, and separately around Distribution, you're well on your way to having a high performance data warehouse.
Temp tables in Azure SQL Data Warehouse have a slightly different behaviour to box product SQL Server or Azure SQL Database - they exist at the session level. So all you have to do is convert your CREATE TYPE statements to temp tables and split the MERGE out into separate INSERT / UPDATE / DELETE statements as required.
Example:
CREATE TABLE #DataTypeforCustomerTable (
PersonID INT,
Name VARCHAR(255),
LastModifytime DATETIME
)
WITH
(
DISTRIBUTION = HASH( PersonID ),
HEAP
)
GO
CREATE PROCEDURE usp_upsert_customer_table
AS
BEGIN
-- Add records which do not already exist
INSERT INTO customer_table ( PersonID, Name, LastModifytime )
SELECT PersonID, Name, LastModifytime
FROM #DataTypeforCustomerTable AS source
WHERE NOT EXISTS
(
SELECT *
FROM customer_table target
WHERE source.PersonID = target.PersonID
)
...
Simply load the temp table and execute the stored proc. See here for more details on temp table scope.
If you are altering a large portion of the table then you should consider the CTAS approach to create a new table, then rename it as suggested by Ron.