I have data lying in multiple files with naming convention as {year}/{month}/{date} which have duplicates (every day delta where records may get updated everyday).
I want to create a view that will return the records with the duplicates merged / squashed.
The duplicates will be ranked and only the latest updated records corresponding to each primary key will be returned.
But the use of rowsets in view seems to be not supported. Basically something like this:
CREATE VIEW viewname AS
#sourcedata = EXTRACT //schema
from //filenamePattern (regex)
using Extractors.TSV()
#sourceData = SELECT *,
ROW_NUMBER() OVER(PARTITION BY primary_Key ORDER BY timestamp DESC) AS RowNumber FROM #SourceData;
SELECT //schema
from #sourceData WHERE RowNumber == 1
So that when I do
select * from viewname
I get the merged data directly from the underlying files. How to achieve this ?
It is possible to have multiple EXTRACT statements in a view stacked together with a UNION statement which would implicitly remove duplicates. However is there any particular reason you need to use a view? This will limit your options as you will have to code within the limitations of views (eg they can't be parameterised). You could also use table-valued function, stored procedure or just a plain old script. This would give you many more options, especially if your de-duplication logic is complex. A simple example:
DROP VIEW IF EXISTS vw_removeDupes;
CREATE VIEW vw_removeDupes
AS
EXTRACT someVal int
FROM "/input/input59a.txt"
USING Extractors.Tsv()
UNION
EXTRACT someVal int
FROM "/input/input59b.txt"
USING Extractors.Tsv();
I think it can be solved by table valued function. Have you tried using it?
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-functions
Related
The recommended way of doing an upsert in a delta table is the following.
MERGE INTO users
USING updates
ON users.userId = updates.userId
WHEN MATCHED THEN
UPDATE SET address = updates.addresses
WHEN NOT MATCHED THEN
INSERT (userId, address) VALUES (updates.userId, updates.address)
Here updates is a table. My question is how can we do an upsert directly, that is, without using a source table. I would like to give the values myself directly.
In SQLite, we could simply do the following.
REPLACE INTO table(column_list)
VALUES(value_list);
Is there a simple way to do that for Delta tables?
A source table can be a subquery so the following should give you what you're after.
MERGE INTO events
USING (VALUES(...)) // round brackets are required to denote a subquery
ON false // an artificial merge condition
WHEN NOT MATCHED THEN INSERT *
I have a question regarding Power Query and Tables as parameters for excel.
Right now I can create a table and use it as a parameter for Power query via Drill down.
But I'm unsure how i would proceed with a Table that has multiple values. How can a table be recognized with multiple "values" as a parameter
For example:
I have the following rawdata and parameter tables
Rawdata+parametertables
Now if I wanted to filter after Value2 with a parameter tables I would do a drill down of the parameter tables and load them to excel.
After that I have two tables that I can filter Value2 with an OR Function by 1 and 2
Is it possible to somehow combine this into 1 Table and that it still uses an OR Function to search
Value2
Im asking because I want it to be potentially possible to just add more and more parameters into the table without creating a new table everytime. Basically just copy paste some parameters into the parameter table and be done with it
Thanks for any help in advance
Assuming, you use Parameters only for filtering. There are other ways, but this one looks the best from performance point of view.
You may create Parameters table, so you have such tables:
Note, it's handy to have the same names (Value2) for key column in both tables, otherwise Table.Join will create additional column(s) after merging tables.
Add similar step to filter RawData table:
join = Table.Join(RawData, "Value2", Parameters, "Value2")
I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.
I have a VBA script that generates a query string for a SAP HANA ODBC Connection in Excel. The query is determined by user inputs and can vary greatly in length. The query itself uses many versions of a similar query appended to one another using UNION ALL syntax.
The script sometimes throws a runtime error when trying to refresh. From my research, it has become clear that the reason for this is that the CommandText string exceeds a maximum allowed length of 32,767 (https://ask.sqlservercentral.com/questions/50819/too-long-sql-in-excel-vba.html).
I wondered whether there is a workaround for this, other than using a stored procedure (I am not against this if there is a way to create a stored procedure at runtime then execute it, but I cannot use a predefined stored procedure as my query is always different hence the need for VBA to create it)
Some more info about the dynamic query in VBA:
Column names, as well as parameters, are created dynamically and can be different every time
The query uses groups of lists of product numbers to generate an IN statement for each product group, then sums the sales for those products under the name of the group. These are then all UNION'd together to create one table with grouped records
Example of user input:
Example of resulting query:
WITH SOME_CTE (SOME_FIELDS) AS
(SELECT SOME_STUFF
FROM SOME_TABLE
WHERE SOME_STUFF_IS_GOING_ON)
SELECT GEND "Gender", 'Attribute 1' "Attribute", SUM(UNITS) "Units", SUM(VAL) "Value", SUM(MARGIN) "Margin"
FROM SOME_CTE
WHERE PRODUCT IN ('12345', '23456', '34567', '45678')
GROUP BY GEND
UNION ALL
SELECT GEND, 'Attribute 2' ATTR_NAME, SUM(UNITS), SUM(VAL), SUM(MARGIN)
FROM SOME_CTE
WHERE PRODUCT IN ('01234', '02345', '03456', '03567')
GROUP BY GEND
ORDER BY "Gender", "Attribute"
...and so on.
As you can see, with 2 attribute groups containing 4 products each there is no problem, but when we get to about 30 with several hundred each, it could be too long.
Note: I have tried things like shortening field references in the repeated parts of the query string to 1 character etc. which helps but does not solve the problem.
Any help would be greatly appreciated.
One workaround is to send multiple queries. Since you are using union all, you could execute every time single select statement, i.e.
create table in (for example) master database (don't create temporary tables! as they will be dropped after every query) - but before that, make sure you create new table, so delete old one if exists (also drop the table after you are done with it). Now every single select statement you'll change to insert statement, which will insert records to your so-called temporary table.
This way, you'll avoid lengthy queries, you'll just send single insert .. into.. select statements.
At the end, to get all results, you just need simple select query. After getting this data, you should drop that table, as it's no longer needed.
Maybe I'm missing something simple, but is there a way to write a nested query in AX? I tried some syntax I thought would work, but with no luck.
The following standard SQL statement would accomplish what I'm trying to do, but I need to do this in AX, not SQL.
SELECT table1.column1A, table1.column1B,
(SELECT Top 1 column2B FROM table2
WHERE table1.column1A = table2.column2A
ORDER BY table2.column1A)
AS lookupResult
FROM table1
My problem is that table1 has a one-to-many relationship with table2, and since AX doesn't have a DISTINCT function that I'm aware of, I receive many copies of each record when using a JOIN statement.
Thanks
Nested queries are not supported in AX.
One way to bypass the missing distinct is to use group by (assuming max value of column2B is interesting):
while select column1A, column1B from table1
group column1A, column1B
join max-of(column2B) from table2
where table2.column2A == table1.column1A
{
...
}
Another method would be use a display method on table1 in the form or report.
display ColumnB column2B()
{
return (select max-of(column2B) from table2
where table2.column2A == this.column1A).column2A;
}
The performance is inferior to the first solution, but it may be acceptable.
As mentioned in the previous reply, group-by is the closest you can get to a distinct function. If you need a simpler query for some reason, or if you need a table or query object to use as a datasource on a form or report, you may entertain the idea of creating a view in the AOT, which contains the group-by. You can then use that view to easily join to on a query object or form datasource etc...
Ax2012 has support of computed columns in views, you can use the SysComputedColumn class to build query you want