Azure Data Factory - Executing Mathematical Operation from column value - azure

I am new to Azure Data Factory, and I have searched everywhere for a solution that may be implemented for my necessity, but I haven't found any.
My Problem:
I have a table in Azure Database with a column containing a mathematical operation, about 50 columns containing the variables for the operation and one last column where I need to update the result of the mathematical operation, like this:
Example of the table
What I want to do is to fill up the column "result" with the result of the mathematical operation, contained in the column "Operation", using the other columns values in the expression. This is just an example table, my actual table has about 50 columns of values, so it is not a solution for me to use a "replace" operation.

There are probably a few ways to do this but I would not use Data Factory, unless you need to orchestrate this activity as part of a wider pipeline. As you have some compute handy via Azure SQL Database, I would make best use of that unless you have a specific reason not to do so. T-SQL has dynamic SQL and the EXEC command to help. Use a cursor to run through the distinct list of formulas and execute it dynamically. A simplified example:
DROP TABLE IF EXISTS dbo.formulas;
CREATE TABLE dbo.formulas (
Id INT PRIMARY KEY,
formula VARCHAR(100) NOT NULL,
a INT NOT NULL,
b INT NOT NULL,
c INT NOT NULL,
d INT NOT NULL,
e INT NOT NULL,
--...
result INT
);
-- Set up test data
INSERT INTO dbo.formulas ( Id, formula, a, b, c, d, e )
VALUES
( 1, '(a+b)/d', 1, 20, 2, 3, 1 ),
( 2, '(c+b)*(a+e)', 0, 1, 2, 3, 4 ),
( 3, 'a*(d+e+c)', 7, 10, 6, 2, 1 )
SET NOCOUNT ON
-- Create local fast_forward ( forward-only, read-only ) cursor
-- Get the distinct formulas for the table
DECLARE formulaCursor CURSOR FAST_FORWARD LOCAL FOR
SELECT DISTINCT formula
FROM dbo.formulas
-- Cursor variables
DECLARE #sql NVARCHAR(MAX)
DECLARE #formula NVARCHAR(100)
OPEN formulaCursor
FETCH NEXT FROM formulaCursor INTO #formula
WHILE ##fetch_status = 0
BEGIN
SET #sql = 'UPDATE dbo.formulas
SET result = ' + #formula + '
--OUTPUT inserted.id -- optionally output updated ids
WHERE formula = ''' + #formula + ''';'
PRINT #sql
-- Update each result field for the current formula
EXEC(#sql)
FETCH NEXT FROM formulaCursor INTO #formula
END
CLOSE formulaCursor
DEALLOCATE formulaCursor
GO
SET NOCOUNT OFF
GO
-- Check the results
SELECT *
FROM dbo.formulas;
Cursors have a bad reputation for performance but i) here I'm using the distinct list of formulas and ii) sometimes it's the only way. I can't think of a nice set-based way of doing this - happy to be corrected. CLR is not available to you. If performance is a major issue for you you may need to think about alternatives; there's an interesting discussion on a similar problem here.
My results:
If your database was an Azure Synapse Analytics dedicated SQL pool then you could look at Azure Synapse Notebooks to achieve the same outcome.

Related

I want to use SQL_VARIANT datatype in external table Azure SQL and I get the "Index was out of range error."

I have two SQL Azure databases - DatabaseA and DatabaseB on a server hosted in Azure.
I need to access a view on DatabaseA from DatabaseB - namely I need the sys.identity_columns in DatabaseA to be available to me on DatabaseB. So I am creating an external table on DatabaseB that links to this information like this (I didn't include all the columns but I included the one causing the problem)
CREATE EXTERNAL TABLE [SOURCE_SYS].[identity_columns](
[object_id] int not null
,[name] nvarchar(128) null
,[column_id] int not null
,[system_type_id] tinyint not null
,[seed_value] sql_variant null
)
WITH
(
DATA_SOURCE = MyElasticDBQueryDataSrc,
SCHEMA_NAME = 'sys',
OBJECT_NAME = 'identity_columns'
);
When I run this - it works. But when I try to use the result - select * from [SOURCE_SYS].[identity_columns] - I get this error:
Msg 46823, Level 16, State 1, Line 50
Error retrieving data from MyServer.database.windows.net.DatabaseA. The underlying error message received was: 'Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index'.
If I comment out the fields in this table that have the sql_variant datatypes - it works fine but I do need the information in that field and the other two sql_variant fields that exist in the same table. MyElasticDBQueryDataSrc works fine on other similar tables without the sql_variant type.
Can anyone suggest what I might be doing wrong? Or suggest a workaround? I tried using bigints as it is mostly seed values that are either integers or null but that didn't work because it told me it wasn't the same datatype.
Any help much appreciated.
Well - after a weekend of sleep I figured out the answer!
If you use nvarchar(30) in he external table definition - you can then convert it to a bigint in any query you use it in
CREATE EXTERNAL TABLE [SOURCE_SYS].[identity_columns](
[object_id] int not null
,[name] nvarchar(128) null
,[column_id] int not null
,[system_type_id] tinyint not null
,[seed_value] nvarchar(30) null
)
WITH
(
DATA_SOURCE = MyElasticDBQueryDataSrc,
SCHEMA_NAME = 'sys',
OBJECT_NAME = 'identity_columns'
);
Now I can access the value like this:
select cast(isnull([seed_value], 0) as bigint) from SOURCE_SYS.identity_columns
Beware that if you do a select * from - you will need to do the variants separately from the rest of the query - you'll get this error:
Msg 46825, Level 16, State 1, Line 58
The data type of the column 'seed_value' in the external table is different than the column's data type in the underlying standalone or sharded table present on the external source.
Hope this is helpful to someone!

virtual file set column and rowset variable U-SQL

I'm having an issue with scheduling job in Data Factory.
I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern.
/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store.
In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE #file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
#result = EXTRACT [Id] long,
....
date DateTime
FROM #file_set_path
USING someextractor;
#rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
#finalResult = SELECT [Id],... FROM #result
CROSS JOIN #rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?
I think this approach should work unless there is something not quite right somewhere, ie can you confirm the datatypes of the dbo.SyncTable table? Dump out #rdate and make sure the value you get there is what you expect.
I put together a simple demo which worked as expected. My copy of SyncTable had one record with the value of 01/01/2018:
#working =
SELECT *
FROM (
VALUES
( (int)1, DateTime.Parse("2017/12/31") ),
( (int)2, DateTime.Parse("2018/01/01") ),
( (int)3, DateTime.Parse("2018/02/01") )
) AS x ( id, someDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM dbo.SyncTable;
//#output =
// SELECT *
// FROM #rdate;
#output =
SELECT *, (w.someDate - r.maxDate).ToString() AS diff
FROM #working AS w
CROSS JOIN
#rdate AS r
WHERE w.someDate >= r.maxDate;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
I did try this with a filepath (full script here). The thing to remember is the custom date format H represents the hour as a number from 0 to 23. If your SyncTable date does not have a time component to it when you insert it, it will default to midnight (0), meaning the whole day will be collected. Your file structure should look something like this according to your pattern:
"D:\Data Lake\USQLDataRoot\Data\SomeEntity\2017\12\31\SomeEntity_2017_12_31__8\test.csv"
I note your filepath has underscores in the second section and a double underscore before the hour section (which will be between 0 and 23, single digit up to the hour 10). I notice your fileset path does not have a file type or quotes - I've used test.csv in my tests. My results:
Basically I think the approach will work, but there is something not quite right, maybe in your file structure, the value in your SyncTable, the datatype etc. You need to go over the details, dump out intermediate values to check until you find the problem.
Doesn't the gist of wBob's full script resolve your issue? Here is a very slightly edited version of wBob's full script to address some of the issues you raised:
Ability to filter on SyncTable,
last part of pattern is file name and not folder. Sample file and structure: \Data\SomeEntity\2018\01\01\SomeEntity_2018_01_01__1
DECLARE #file_set_path string = #"/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H}";
#input =
EXTRACT [Id] long,
date DateTime
FROM #file_set_path
USING Extractors.Text();
// in lieu of creating actual table
#syncTable =
SELECT * FROM
( VALUES
( "SomeEntity", new DateTime(2018,01,01,01,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,01,00,00) ),
( "SomeEntity", new DateTime(2018,01,01,00,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,00,00,00) ),
( "SomeEntity", new DateTime(2017,12,31,23,00,00) ),
( "AnotherEntity", new DateTime(2017,12,31,23,00,00) )
) AS x ( EntityName, ProcessedDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM #syncTable
WHERE EntityName== "SomeEntity";
#output =
SELECT *,
date.ToString() AS dateString
FROM #input AS i
CROSS JOIN
#rdate AS r
WHERE i.date >= r.maxDate;
OUTPUT #output
TO "/output/output.txt"
ORDER BY Id
USING Outputters.Text(quoting:false);
Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.

How do you insert a string or text as a blob in Cassandra (specifically CQLSH)?

I was trying to insert text or some string as a blob for testing purposes in CQLSH
insert into test_by_score (commit, delta, test, score)
values (textAsBlob('bdb14fbe076f6b94444c660e36a400151f26fc6f'), 0,
textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100);
It didn't really work, because after I did:
select * from test_by_score where commit = 0x0b5db8b91bfdeb0a304b372dd8dda123b3fd1ab6;
its said there were 0 columns...which was a little unexpected (because it didn't throw an error at me) but I guess textAsBlob is not a thing in cqlsh. Then does someone know how to do this?
Schema:
CREATE TABLE IF NOT EXISTS test_by_score (
commit blob,
delta int,
score int,
test blob,
PRIMARY KEY(commit, delta, test)
);
I have posted my schema a little reluctantly because I believe my question is not really about this specific schema. What I simply want to to know is, if there is one column that holds blobs, is it possible to insert a string in that position by first changing it to a blob and then inserting it in cqlsh?
The following seems to be working. Your WHERE condition in your SELECT statement may be trying to access the incorrect hex 0x0b5db8b91bfdeb0a304b372dd8dda123b3fd1ab6.
DROP KEYSPACE example;
CREATE KEYSPACE example WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
USE example;
CREATE TABLE IF NOT EXISTS test_by_score (
commit blob, -- blob representing the commit hash
delta int, -- how much the scores have changed
score int, -- the test score, which is determined by the client
test blob, -- blob for the test
PRIMARY KEY(commit, delta, test)
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES
(textAsBlob('bdb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES (
textAsBlob('cdb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES (
textAsBlob('adb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
To grab all the rows
SELECT * FROM example.test_by_score
To select a specific row.
SELECT * FROM example.test_by_score
WHERE commit = 0x62646231346662653037366636623934343434633636306533366134303031353166323666633666;

Oracle spatial data operator - SDO_nn - Not getting any results for sdo_num_res = 1

I am using SDO_NN operator to find the nearest hydrant next to a building.
Building:
CREATE TABLE "BUILDINGS"
(
"NAME" VARCHAR2(40),
"SHAPE" "SDO_GEOMETRY")
Hydrant:
CREATE TABLE "HYDRANTS"
( "NAME" VARCHAR2(10),
"POINT" "SDO_POINT_TYPE"
);
I have setup spatial indexes properly for buildings.shape and I run the query to get the nearest hydrant to the building 'Motel'
select b1.name as name, h.point.x as x, h.point.y as y from buildings b1, hydrants h where b1.name ='Motel' and
SDO_nn( b1.shape, MDSYS.SDO_GEOMETRY(2003,NULL, NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),
SDO_ORDINATE_ARRAY( h.point.x,h.point.y)), 'sdo_num_res=1')= 'TRUE';
Here's the problem:
When I set the parameter sdo_num_res=1, I get zero tuples.
And when I make sdo_num_res=2, I get one tuple.
What is the reason for the weird behavior ?
Note: I am getting zero rows only when building.name= 'Motel', for all other tuples I am getting 1 row when sdo_num_res = 1
Edit:
Insert queries
Insert into buildings (NAME,SHAPE) values ('Motel',MDSYS.SDO_GEOMETRY(2003,NULL,NULL,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447)));
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
To perform spatial comparisons between a point to a polygon, the SDO_GEOMETRY is defined with SDO_SRID=2001 and center set to a SDO_POINT_TYPE-> which we want to compare.
MDSYS.SDO_GEOMETRY(2001, NULL, SDO_POINT_TYPE(-79, 37, NULL), NULL, NULL)
First of all, your query does not do what you say it does: it actually returns the nearest building called "Motel" from any of your hydrants. To do what you want (i.e. the opposite) you need to reverse the order of the arguments to SDO_NN: all spatial operators search the first argument, using the value of the second argument.
Then the insert into your HYDRANTS table is wrong:
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
The SDO_POINT_TYPE object is not designed to be used that way: it is only used inside the SDO_GEOMETRY type. The proper way is this:
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
And of course you need to change your table definition accordingly.
Then your building is also incorrectly created: a polygon must always close, i.e. the last point must be the same as the first point. So the proper shape should be like this:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
Here is the full example:
Create the tables:
create table buildings (
name varchar2(40) primary key,
shape sdo_geometry
);
create table hydrants(
name varchar2(10) primary key,
point sdo_geometry
);
Populate the tables:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
commit;
Confirm that the geometries are all correct:
select name, sdo_geom.validate_geometry_with_context (point, 0.05) from hydrants;
select name, sdo_geom.validate_geometry_with_context (shape, 0.05) from buildings;
Setup spatial metadata and create spatial indexes:
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'BUILDINGS',
'SHAPE',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index buildings_sx on buildings (shape)
indextype is mdsys.spatial_index;
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'HYDRANTS',
'POINT',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index hydrants_sx on hydrants (point)
indextype is mdsys.spatial_index;
Now Try the properly written query:
select h.name, h.point.sdo_point.x as x, h.point.sdo_point.y as y
from buildings b, hydrants h
where b.name ='Motel'
and sdo_nn(h.point, b.shape, 'sdo_num_res=1')= 'TRUE';
which returns:
NAME X Y
---------------- ---------- ----------
p57 589 448
1 row selected.

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources