Azure Synapse Serverless SQL Pool - Error while selecting from quoted fields - azure

I am getting an error while selecting from a CSV file that contains quoted fields using Serverless/ OnDemand SQL Pool in Azure Synapse. The data contains the field terminator (,) within the fields, but it's quoted with double quotes. I have even tried to specify the FIELDQUOTE explicitly even though I am using the default quote in the data.
My file contains the data as below
"number", "text"
1, "one"
2, "two"
11, "one, one"
12, "one, two"
The SQL that I ran is as below
SELECT
*
FROM
OPENROWSET(
BULK 'https://mydatalake.dfs.core.windows.net/data/test_quoted_fields.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
HEADER_ROW = TRUE
) AS [result]
And the error message is as below
Error handling external file: 'Quotes '' must be inside quoted fields at [byte: 10]. '. File/External table name: 'https://mydatalake.dfs.core.windows.net/data/test_quoted_fields.csv'
Please note that I am running the query using Serverless/ OnDemand SQL Pool.
Can someone help please? Thanks
Data in ADLS Portal looks as below
In Edit mode
In Preview mode

I so appears that having .csv in excel and after uploading to ADLS would add additional " " around quoted fields so as to be seen as a string. And it does seem to work well.

Related

Write DDL to .sql file using Pandas

I am trying to extract the DDL of tables and store it in .sql files using pandas
The code I have tried is :
query = "show table tablename"
df = pd.read_sql(query, connect)
df.to_csv('xyz.sql', index=False, header=False, quoting=None)
This creates a .sql file with the DDL like this -
" CREATE TABLE .....
.... ; "
How do I write the file without the quotes, like -
CREATE TABLE .....
.... ;
Given a string s, such as "CREATE ...",
one can delete double-quote characters with:
s = s.replace('"', '')
And don't forget
maketrans,
which (with translate) is very good at efficiently
deleting unwanted characters from very long strings.

Snowflake operator (airflow) not rendering SQL properly

I am trying to run a simple SQL Query using Airflow provider Snowflake (1.3.0)
SnowflakeOperator(
task_id=f'task',
snowflake_conn_id='snowflake_conn',
parameters={
"table": "dummy_table",
},
sql=["delete from %(table)s"],
autocommit=True,
dag=dag,
)
The SQL it is rendering is delete from ''dummy''. I want to get rid of '' but have tried everything and nothing seems to be working
To parametrize table name IDENFITIER should be used:
To use an object name specified in a literal or variable, use IDENTIFIER().
sql=["delete from IDENTIFIER(%(table)s)"],
The query DELETE FROM 'dummy' is not correct, but DELETE FROM IDENTIFIER('dummy') will work.
CREATE TABLE dummy(id INT);
DELETE FROM 'dummy';
-- Syntax error: unexpected ''dummy''. (line 4)
DELETE FROM IDENTIFIER('dummy');
-- number of rows deleted: 0
If you are using parameters then it's up to SQLAlchemy. You can find more information about it in How to render a .sql file with parameters in MySqlOperator in Airflow?
Alternatively, you can use Airflow rendering (Jinja engine) with params:
SnowflakeOperator(
task_id=f'task',
snowflake_conn_id='snowflake_conn',
params={
"table": "dummy_table",
},
sql=["delete from {{ params.table }}"],
autocommit=True,
dag=dag,
)
This will be rendered as:
thus the query that will be submitted to Snowflake is:
delete from dummy_table

how to concatenate multiple row or column data into one row or column from a text file while importing data into db2 table

For Eg:
1)File has
ID|Name|job|hobby|salary|hobby2
2)Data:
1|ram|architect|tennis|20000|cricket
1|ram|architect|football|20000|gardening
2|krish|teacher|painting|25000|cooking
3)Table:
Columns in table: ID-Name-Job-Hobby-Salary
Is it possible to load data into table as below:
1-ram-architect-tenniscricketfootbalgardening-20000
2-krish-teacher-paintingcooking-25000
Command: db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Job,Hobby,salary) nonrecoverable"
You cannot achieve what you think you want in a single action with either LOAD CLIENT or IMPORT.
You are asking to denormalize, and I presume you understand the consequences.
Regardless, you can use a multi-step approach, first load/import into a temporary table, and then in a second step use SQL to denormalize into the final table, before discarding the temporary table.
Or if you are adept with awk , and the data file is correctly sorted, then you can pre-process the file externally to a database before load/import.
Or use an ETL tool.
You may use the INGEST command instead of LOAD.
You must create the corresponding infrastructure for this command beforehand with the following command, for example:
CALL SYSINSTALLOBJECTS('INGEST', 'C', 'USERSPACE1', NULL);
Load your file afterwards with the following command:
INGEST FROM FILE ABC.FILE
FORMAT DELIMITED by '|'
(
$id INTEGER EXTERNAL
, $name CHAR(8)
, $job CHAR(20)
, $hobby CHAR(20)
, $salary INTEGER EXTERNAL
, $hobby2 CHAR(20)
)
MERGE INTO tablename
ON ID = $id
WHEN MATCHED THEN
UPDATE SET hobby = hobby CONCAT $hobby CONCAT $hobby2
WHEN NOT MATCHED THEN
INSERT (ID, NAME, JOB, HOBBY, SALARY) VALUES($id, $name, $job, $hobby CONCAT $hobby2, $salary);

Kusto/Azure Data Explorer - How can I partition an external table using a timespan field?

Hoping someone can help..
I am new to Kusto and have to get an external table reading data from an Azure Blob storage account working, but the one table I have is unique in that the data for the timestamp column is split into 2 separate columns , i.e. LogDate and LogTime (see script below).
My data is stored in the following structure in the Azure Storage account container (container is named "employeedata", for example):
{employeename}/{year}/{month}/{day}/{hour}/{minute}.csv, in a simple CSV format.
I know the CSV is good because if I import it into a normal Kusto table, it works perfectly.
My KQL script for the external table creation looks as follows:
.create-or-alter external table EmpLogs (Employee: string, LogDate: datetime, LogTime:timestamp)
kind=blob
partition by (EmployeeName:string = Employee, yyyy:datetime = startofday(LogDate), MM:datetime = startofday(LogDate), dd:datetime = startofday(LogDate), HH:datetime = todatetime(LogTime), mm:datetime = todatetime(LogTime))
pathformat = (EmployeeName "/" datetime_pattern("yyyy", yyyy) "/" datetime_pattern("MM", MM) "/" datetime_pattern("dd", dd) "/" substring(HH, 0, 2) "/" substring(mm, 3, 2) ".csv")
dataformat=csv
(
h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
I am getting the error below constantly, which is not very helpful (redacted from full error, basically comes down to the fact there is a syntax error somewhere):
Syntax error: Query could not be parsed: {
"error": {
"code": "BadRequest_SyntaxError",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.Data.Exceptions.SyntaxException",
"#message": "Syntax error: Query could not be parsed: . Query: '.create-or-alter external table ........
I know the todatetime() function works on timespan's, I tested it with another table and it created a date similar to the following: 0001-01-01 20:18:00.0000000.
I have tried using the bin() function on the timestamp/LogTime columns, but the same error as above, and even tried importing the time value as a string and doing some string manipulation on it, no luck. Getting the same syntax error.
Any help/guidance would be greatly appreciated.
Thank you!!
Currently, there's no way to define an external table partition based on more than one column. If your dataset timestamp is splitted between two columns: LogDate:datetime and LogTime:timestamp, then the best you can do is use virtual column for the partition by time:
.create-or-alter external table EmpLogs(Employee: string, LogDate:datetime, LogTime:timespan)
kind=blob
partition by (EmployeeName:string = Employee, PartitionDate:datetime)
pathformat = (EmployeeName "/" datetime_pattern("yyyy/MM/dd/HH/mm", PartitionDate))
dataformat=csv
(
//h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
Now, you can filter by the virtual column and fine tune using LogTime:
external_table("EmpLogs")
| where Employee in ("John Doe", ...)
| where PartitionDate between(datetime(2020-01-01 10:00:00) .. datetime(2020-01-01 11:00:00))
| where LogTime ...

Load connecting tables from Cassandra in QlikView with DataSatx ODBC

I am new to both Cassandra (2.0) and QlikView (11).
I have two keyspaces (tables) with large amount of data in Cassandra and I want to load them to QlikView.
Since I can not load the entire set, filtering is necessary.
// In QlikView's edit script
ODBC CONNECT TO [DataStax Cassandra ODBC DSN64];
LOAD idsession,
logintime,
"h_id" as hid;
SQL SELECT *
FROM Cassandra.test.sessions
WHERE logintime > '2015-06-09'
ALLOW FILTERING;
LOAD idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in hid;
The second query does not work, error from qlikview line 3:16 no viable alternative at input 'hid'.
My question: is it possible to get the h_ids from the first query and only collect the corresponding entities from the second table?
I assume that you can't do an Exists in the DataSyntax ODBC which may help. DataStax doc
This could be done with an external program like (C#) but I really want to do this in QlikView's script file:
// Not complete code
query = select * from sessions where loginTime > '2015-06-09';
foreach (var id in query) {
query2 = "select * from hosts where idhost = " + i;
}
EDIT
This can be solved when loading XML files:
TableA:
LOAD id,
itema
FROM
[C:\test1data.xlsx]
(ooxml, embedded labels);
TableB:
LOAD idb,
itemb,
ida
FROM
[C:\test2data.xlsx]
(ooxml, embedded labels) where(Exists (id,ida));
EDIT2
Besides the great answer from #i_saw_drones another solutions is to loop through ids.
For i = 1 to NoOfRows('Sessions')
Let cur_id = Peek('hid',i - 1,'Sessions');
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost = $(cur_id);
NEXT i
Nevertheless was the performance not the great. It took about 30 minutes to load around 300 K lines from Cassandra. The same queries were tested in a C# program with the connector and it took 9 sec. But that was just the query. Then you should write it to XML and then load it to QlikView.
The reason that the second query fails is because the WHERE clause is expecting to find a literal string list of values to look "in". For example:
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in ('ID1', 'ID2', 'ID3', 'ID4');
The hid field returned by the first query is a QlikView list and as such cannot be immediately coerced into a string. We have to do a little more scripting to obtain a list of values from the first query in literal form, and then add that to the second query as part of the WHERE clause. The easiest way to do this is to concatenate all of your hids into a string and then use the string as part of your WHERE IN clause.
ODBC CONNECT TO [DataStax Cassandra ODBC DSN64];
MyData:
LOAD
idsession,
logintime,
"h_id" as hid;
SQL SELECT *
FROM Cassandra.test.sessions
WHERE logintime > '2015-06-09'
ALLOW FILTERING;
hid_entries:
LOAD
chr(39) & hids & chr(39) as hids;
LOAD
concat(hid, chr(39) & ',' & chr(39)) as hids;
LOAD DISTINCT
hid
RESIDENT MyData;
LET hid_values = '(' & peek('hids',0,'hid_entries') & ')';
DROP TABLE hid_entries;
LOAD
idhost,
site;
SQL SELECT *
FROM Cassandra.test.hosts
WHERE idhost in $(hid_values);

Resources