Azure Synapse: Cannot bulk load because the file could not be opened. Operating system error code 12(The access code is invalid.) - azure

I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.
This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:
Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)
Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.
The code I am using to query the data lake is below:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
Here the data source, analytics is a data source specified as follows:
CREATE EXTERNAL DATA SOURCE analytics
WITH
(
location = 'https://<url>.dfs.core.windows.net/analytics'
)
I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.
The query is running here on the built-in serverless pool.
Any ideas on how to get around this issue would be appreciated.

Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.
-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH ( LOCATION = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
CREDENTIAL = SynapseIdentity
)
Also see the sections in that article about the firewall on your storage account if you have that locked down.

Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.
I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
or for external tables:
TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This made my final query:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics_master_key',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed

Related

How to structure schemas better in Synapse SQL dedicated pool

Using Azure Synapse , Dedicated SQL pool.
How can I structure my tables underneath a button that represents the schema?
This is a small issue that really makes a big impact when many tables and schemas will be used in the database, and users will need to navigate to the correct schema quickly.
I tried dragging the schema under over the tables section, but nothing worked.
to structure the table first we need to create schema in sql dedicated pool using below command
CREATE SCHEMA <schemaName>
we need to create table using above created schema with required columns with suitable data types to that column.
Table creation:
CREATE TABLE <tableName>(col1 dataType,col2 dataType)
I created external table in dedicated sql pool in synapse following below steps:
Schema creation:
Created external data source:
CREATE EXTERNAL DATA SOURCE [DATASOURCE] WITH
(
LOCATION = '<location>',
)
Image for reference:
created external file format:
CREATE EXTERNAL FILE FORMAT [FileFormat1] WITH
(
FORMAT_TYPE = DELIMITEDTEXT
)
Image for reference:
I created external table using above data source and file format using below code:
CREATE EXTERNAL TABLE [wwi].[information2]
(
[Id] INT
)
WITH
(
LOCATION = '<folder/file>',
DATA_SOURCE = [DATASOURCE1],
FILE_FORMAT = [FileFormat1]
)
Image for reference:
In this we can structure the table in synapse dedicated pool.

Handling partitioned data in Azure?

I have some containers in ADLS (gen2) and have multiple folders within that container. I would like to have a mechanism to scan those folders to infer their schema and detect partitions and update them in the data catalog. How do I achieve this functionality in Azure?
Sample:
- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet
...
So, what I expect is that in the catalog, it will create two tables (table1 & table2) where table1 will have date-based partitions (3 dates for this case) and have underline data within that table. Same for table2 which will have two partitions and their underline data.
In the AWS world, I can run a Glue crawler that can crawl these files, infers schemas and partitions, and populate Glue data catalogs, later I can query them through Athena. What's the Azure equivalent approach to achieve something similar?
I would recommend looking at Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:
-- If you do not have a Master Key on your DW you will need to create one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;
GO
CREATE DATABASE SCOPED CREDENTIAL msi_cred
WITH IDENTITY = 'Managed Service Identity' ;
GO
CREATE EXTERNAL DATA SOURCE ds_container1
WITH
( TYPE = HADOOP ,
LOCATION = 'abfss://container1#mystorageaccount.dfs.core.windows.net' ,
CREDENTIAL = msi_cred
) ;
GO
CREATE VIEW Table2
AS SELECT *, f.filepath(1) AS [zipcode]
FROM
OPENROWSET(
BULK 'table2-folder/*/*.parquet',
DATA_SOURCE = 'ds_container1',
FORMAT='PARQUET'
) AS f
Then setup Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.

Azure Data Factory Error: "incorrect syntax near"

I'm trying to do a simple incremental update from an on-prem database as source to Azure SQL database based on a varchar column called "RP" in On-Prem database that contains "date+staticdescription" for example: "20210314MetroFactory"
1- I've created a Lookup activity called Lookup1 using a table created in Azure SQL Database and uses this Query
"Select RP from SubsetwatermarkTable"
2- I've created a Copy data activity where the source settings have this Query
"Select * from SourceDevSubsetTable WHERE RP NOT IN '#{activity('Lookup1').output.value}'"
When debugging -- I'm getting the error:
Failure type: User configuration issue
Details: Failure happened on 'Source' side.
'Type=System.Data.SqlClient.SqlException,Message=Incorrect syntax near
'[{"RP":"20210307_1Plant
1KAO"},{"RP":"20210314MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"2'.,Source=.Net
SqlClient Data
Provider,SqlErrorNumber=102,Class=15,ErrorCode=-2146232060,State=1,Errors=[{Class=15,Number=102,State=1,Message=Incorrect
syntax near
'[{"RP":"20210311MetroFactory"},{"RP":"20210311MetroFactory"},{"RP":"202103140MetroFactory"},{"RP":"20210308MetroFactory"},{"RP":"2'.,},],'
Can anyone tell me what I am doing wrong and how to fix it even if it requires creating more activities.
Note: There is no LastModifiedDate column in the table. Also I haven't yet created the StoredProcedure that will update the Lookup table when it is done with the incremental copy.
Steve is right as to why it is failling and the query you need in the Copy Data.
As he says, you want a comma-separated list of quoted values to use in your IN clause.
You can get this more easily though - from your Lookup directly using this query:-
select stuff(
(
select ','''+rp+''''
from subsetwatermarktable
for xml path('')
)
, 1, 1, ''
) as in_clause
The sub-query gets the comma separated list with quotes around each rp-value, but has a spurious comma at the start - the outer query with stuff removes this.
Now tick the First Row Only box on the Lookup and change your Copy Data source query to:
select *
from SourceDevSubsetTable
where rp not in (#{activity('lookup').output.firstRow.in_clause})
The result of #activity('Lookup1').output.value is an array like your error shows
[{"RP":"20210307_1Plant
1KAO"},{"RP":"20210314MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"20210312MetroFactory"},{"RP":"2'.,Source=.Net
SqlClient Data
Provider,SqlErrorNumber=102,Class=15,ErrorCode=-2146232060,State=1,Errors=[{Class=15,Number=102,State=1,Message=Incorrect
syntax near
'[{"RP":"20210311MetroFactory"},{"RP":"20210311MetroFactory"},{"RP":"202103140MetroFactory"},{"RP":"20210308MetroFactory"},{"RP":"2'.,},]
However, your SQL should be like this:Select * from SourceDevSubsetTable WHERE RP NOT IN ('20210307_1Plant 1KAO','20210314MetroFactory',...).
To achieve this in ADF, you need to do something like this:
create three variables like the following screenshot:
loop your result of #activity('Lookup1').output.value and append 'item().RP' to arrayvalues:
expression:#activity('Lookup1').output.value
expression:#concat(variables('apostrophe'),item().RP,variables('apostrophe'))
3.cast arrayvalues to string and add parentheses by Set variable activity
expression:#concat('(',join(variables('arrayvalues'),','),')')
4.copy to your Azure SQL database
expression:Select * from SourceDevSubsetTable WHERE RP NOT IN #{variables('stringvalues')}

Azure Datalake Gen2 as external table for Azure Data Explorer

We have CSV files in Azure Data lake Gen 2 under partitioned folder, so there will be multiple CSV files for a single large table. We want to consume these files in Azure Data Explorer by creating an external table. so I am using below script to create an external table in ADX:
.create external table TestAdx
(
id: int,
name: string,
designation: string
)
kind=adl
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/staging/textadx;token=<<generating using .net API>>'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
namePrefix="Prefix"
)
I am able to execute this query and the external table is created but when I try to fetch data from this table it is giving below error:
Semantic error: 'TestAdx' has the following semantic error: ''
operator: Failed to resolve table or column or scalar expression named
'TestAdx'.
Also please let me know is this the correct approach to work with ADLS Gen2 file form ADX?
what is the query you're running? are you using the external_table() function?
You need to use external_table("TestAdx") to access the external table.
Following is the example for creating external table with Azure Data Explorer with Azure Data Lake Gen 2. I have added the partition key and other parameters.
.create external table BugsCSV
(
Column1 : string,
Column2 : string,
Column3 : string
)
kind=adl
partition by "State="State
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/path;key'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
compressed=true,
compressiontype="lz4"
)

Is there any alternative of CREATE TYPE in SQL as CREATE TYPE is Not supported in Azure SQL data warehouse

I am trying to execute this query but as userdefined(Create type) types are not supportable in azure data warehouse. and i want to use it in stored procedure.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
CREATE PROCEDURE usp_upsert_customer_table #customer_table DataTypeforCustomerTable READONLY
AS
BEGIN
MERGE customer_table AS target
USING #customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
CREATE TYPE DataTypeforProjectTable AS TABLE(
Project varchar(255),
Creationtime datetime
);
GO
CREATE PROCEDURE usp_upsert_project_table #project_table DataTypeforProjectTable READONLY
AS
BEGIN
MERGE project_table AS target
USING #project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Is there any alternative way to do this.
You've got a few challenges there, because most of what you're trying to convert is not the way to do things on ASDW.
First, as you point out, CREATE TYPE is not supported, and there is no equivalent alternative.
Next, the code appears to be doing single inserts to a table. That's really bad on ASDW, performance will be dreadful.
Next, there's no MERGE statement (yet) for ASDW. That's because UPDATE is not the best way to handle changing data.
And last, stored procedures work a little differently on ASDW, they're not compiled, but interpreted each time the procedure is called. Stored procedures are great for big chunks of table-level logic, but not recommended for high volume calls with single-row operations.
I'd need to know more about the use case to make specific recommendations, but in general you need to think in tables rather than rows. In particular, focus on the CREATE TABLE AS (CTAS) way of handling your ELT.
Here's a good link, it shows how the equivalent of a Merge/Upsert can be handled using a CTAS:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-ctas#replace-merge-statements
As you'll see, it processes two tables at a time, rather than one row. This means you'll need to review the logic that called your stored procedure example.
If you get your head around doing everything in CTAS, and separately around Distribution, you're well on your way to having a high performance data warehouse.
Temp tables in Azure SQL Data Warehouse have a slightly different behaviour to box product SQL Server or Azure SQL Database - they exist at the session level. So all you have to do is convert your CREATE TYPE statements to temp tables and split the MERGE out into separate INSERT / UPDATE / DELETE statements as required.
Example:
CREATE TABLE #DataTypeforCustomerTable (
PersonID INT,
Name VARCHAR(255),
LastModifytime DATETIME
)
WITH
(
DISTRIBUTION = HASH( PersonID ),
HEAP
)
GO
CREATE PROCEDURE usp_upsert_customer_table
AS
BEGIN
-- Add records which do not already exist
INSERT INTO customer_table ( PersonID, Name, LastModifytime )
SELECT PersonID, Name, LastModifytime
FROM #DataTypeforCustomerTable AS source
WHERE NOT EXISTS
(
SELECT *
FROM customer_table target
WHERE source.PersonID = target.PersonID
)
...
Simply load the temp table and execute the stored proc. See here for more details on temp table scope.
If you are altering a large portion of the table then you should consider the CTAS approach to create a new table, then rename it as suggested by Ron.

Resources