Azure Datalake Gen2 as external table for Azure Data Explorer - azure

We have CSV files in Azure Data lake Gen 2 under partitioned folder, so there will be multiple CSV files for a single large table. We want to consume these files in Azure Data Explorer by creating an external table. so I am using below script to create an external table in ADX:
.create external table TestAdx
(
id: int,
name: string,
designation: string
)
kind=adl
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/staging/textadx;token=<<generating using .net API>>'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
namePrefix="Prefix"
)
I am able to execute this query and the external table is created but when I try to fetch data from this table it is giving below error:
Semantic error: 'TestAdx' has the following semantic error: ''
operator: Failed to resolve table or column or scalar expression named
'TestAdx'.
Also please let me know is this the correct approach to work with ADLS Gen2 file form ADX?

what is the query you're running? are you using the external_table() function?

You need to use external_table("TestAdx") to access the external table.

Following is the example for creating external table with Azure Data Explorer with Azure Data Lake Gen 2. I have added the partition key and other parameters.
.create external table BugsCSV
(
Column1 : string,
Column2 : string,
Column3 : string
)
kind=adl
partition by "State="State
dataformat=csv
(
h#'abfss://containername#storageaccountname.dfs.core.windows.net/path;key'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
compressed=true,
compressiontype="lz4"
)

Related

How to structure schemas better in Synapse SQL dedicated pool

Using Azure Synapse , Dedicated SQL pool.
How can I structure my tables underneath a button that represents the schema?
This is a small issue that really makes a big impact when many tables and schemas will be used in the database, and users will need to navigate to the correct schema quickly.
I tried dragging the schema under over the tables section, but nothing worked.
to structure the table first we need to create schema in sql dedicated pool using below command
CREATE SCHEMA <schemaName>
we need to create table using above created schema with required columns with suitable data types to that column.
Table creation:
CREATE TABLE <tableName>(col1 dataType,col2 dataType)
I created external table in dedicated sql pool in synapse following below steps:
Schema creation:
Created external data source:
CREATE EXTERNAL DATA SOURCE [DATASOURCE] WITH
(
LOCATION = '<location>',
)
Image for reference:
created external file format:
CREATE EXTERNAL FILE FORMAT [FileFormat1] WITH
(
FORMAT_TYPE = DELIMITEDTEXT
)
Image for reference:
I created external table using above data source and file format using below code:
CREATE EXTERNAL TABLE [wwi].[information2]
(
[Id] INT
)
WITH
(
LOCATION = '<folder/file>',
DATA_SOURCE = [DATASOURCE1],
FILE_FORMAT = [FileFormat1]
)
Image for reference:
In this we can structure the table in synapse dedicated pool.

Azure Synapse: Cannot bulk load because the file could not be opened. Operating system error code 12(The access code is invalid.)

I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.
This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:
Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)
Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.
The code I am using to query the data lake is below:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
Here the data source, analytics is a data source specified as follows:
CREATE EXTERNAL DATA SOURCE analytics
WITH
(
location = 'https://<url>.dfs.core.windows.net/analytics'
)
I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.
The query is running here on the built-in serverless pool.
Any ideas on how to get around this issue would be appreciated.
Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.
-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH ( LOCATION = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
CREDENTIAL = SynapseIdentity
)
Also see the sections in that article about the firewall on your storage account if you have that locked down.
Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.
I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
or for external tables:
TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This made my final query:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics_master_key',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed

Load FileName in SQL Table using Copy Data Activity

I am newbie to Azure Data Factory. I'm trying to load multiple files of various states from FTP location into a single Azure SQL Server table. My requirement is to get state name from of the file and dump it into table along with actual data.
Currently, my source is FTP. Sink is Azure SQL Server table. I have used Stored Procedure to load the data. However, I'm unable to send file name as a parameter as shown below to the stored procedure so that I can dump it into the table. Below is the Copy Data component -
I have defined SourceFileName parameter in stored procedure, however, I am unable to send it via COPY Data activity.
Any help is appreciated.
We can conclude that additional column option can not be used here. Because ADF will return a column(contain filepath) not a string. So we need to use GetMetaData activity to get the file list, then foreach the file list and inside a Foreach activity to copy them.
I've created a simple test, it works well.
In my local FTP server, there is two text files. I need to copy them into an Azure SQL table.
At GetMetaData activity, I use Child Items to get the filelist.
At ForEach activity, I use #activity('Get Metadata1').output.childItems to foreach the file list.
Inside the ForEach activity, I use dynamic content #item().name to get the file path.
source setting:
sink setting:
So we can get the filename. Follows are some operations I did on Azure SQL.
-- create a table
CREATE TABLE [dbo].[employee](
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL,
[filePath] [varchar](50) NULL
) ON [PRIMARY]
GO
-- create a table type
CREATE TYPE [dbo].[ct_employees_type] AS TABLE(
[firstName] [varchar](50) NULL,
[lastName] [varchar](50) NULL
)
GO
-- create a Stored procedure
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[spUpsertEmployees]
#employees ct_employees_type READONLY,
#filePath varchar(50)
AS
BEGIN
set #filePath = SUBSTRING( #filePath,1,len(#filePath)-4)
MERGE [dbo].[employee] AS target_sqldb
USING #employees AS source_tblstg
ON (target_sqldb.firstName = source_tblstg.firstName)
WHEN MATCHED THEN
UPDATE SET
firstName = source_tblstg.firstName,
lastName = source_tblstg.lastName
WHEN NOT MATCHED THEN
INSERT (
firstName,
lastName,
filePath
)
VALUES (
source_tblstg.firstName,
source_tblstg.lastName,
#filePath
);
END
GO
After I run debug, the result is follows:

How to automatically sync a Hive external table with a MySQL table without using Sqoop?

I'm already having a MySQL table in my local machine (Linux) itself, and I have a Hive external table with the same schema as the MySQL table.
I want to sync my hive external table whenever a new record is inserted or updated. Batch update is ok with me to say hourly.
What is the best possible approach to achieve the same without using sqoop?
Thanks,
Sumit
Without scoop, you can create table STORED BY JdbcStorageHandler. Project repository: https://github.com/qubole/Hive-JDBC-Storage-Handler It will work as usual hive table, but query will run on MySQL. Predicate pushdown will work.
DROP TABLE HiveTable;
CREATE EXTERNAL TABLE HiveTable(
id INT,
id_double DOUBLE,
names STRING,
test INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root",
"mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable",
"mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false"
);

How to create External table in Azure without the location path exist

Is there anyway to create External table in Azure SQL DWH even though the location path mentioned in external table statement doesn't exist.
For eg:- location '/src/temp' doesn't exist still I want external table to be created.
create external table ext.dummy(
PERSON_ID varchar(500) ,
ASSIGNMENT_ID varchar(500)
)
WITH
(
LOCATION='/src/temp',
DATA_SOURCE = YasCdpBlobStorage,
FILE_FORMAT = ExtTableTextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
No, this is not possible. The external location whether it is a folder or a filepath must exist before you create the external table.
Although the documentation does not state this explicitlly, it is implied by the term "actual", ie
LOCATION = 'folder_or_filepath'
Specifies the folder or the file path
and file name for the actual data in Hadoop or Azure blob storage.

Resources