I'm attempting to perform some sort of upsert operation in U-SQL where I pull data every day from a file, and compare it with yesterdays data which is stored in a table in Data Lake Storage.
I have created an ID column in the table in DL using row_number(), and it is this "counter" I wish to continue when appending new rows to the old dataset. E.g.
Last inserted row in DL table could look like this:
ID | Column1 | Column2
---+------------+---------
10 | SomeValue | 1
I want the next rows to have the following ascending ids
11 | SomeValue | 1
12 | SomeValue | 1
How would I go about making sure that the next X rows continues the ID count incrementally such that the next rows each increases the ID column by 1 more than the last?
You could use ROW_NUMBER then add it to the the max value from the original table (ie using CROSS JOIN and MAX). A simple demo of the technique:
DECLARE #outputFile string = #"\output\output.csv";
#originalInput =
SELECT *
FROM ( VALUES
( 10, "SomeValue 1", 1 )
) AS x ( id, column1, column2 );
#newInput =
SELECT *
FROM ( VALUES
( "SomeValue 2", 2 ),
( "SomeValue 3", 3 )
) AS x ( column1, column2 );
#output =
SELECT id, column1, column2
FROM #originalInput
UNION ALL
SELECT (int)(x.id + ROW_NUMBER() OVER()) AS id, column1, column2
FROM #newInput
CROSS JOIN ( SELECT MAX(id) AS id FROM #originalInput ) AS x;
OUTPUT #output
TO #outputFile
USING Outputters.Csv(outputHeader:true);
My results:
You will have to be careful if the original table is empty and add some additional conditions / null checks but I'll leave that up to you.
Related
Assuming I have the following data table:
Period | ID
-----------
P1 | 1
P2 | 1
P1 | 2
P2 | 3
P1 | 2
I am intersted in the number of unique IDs / Period only if the ID has not been counted already in a pervious period, ordered alphabatically. IDs per period in the source themselves can already occure multiple times and shall count as 1 / peroid (distinct count).
Also the data source is not pre-ordered by period and I have no influence on the sort order.
So the result I would like to get in a Pivot is like:
Period | Number of Unique IDs not already counted
-------------------------------------------------
P1 | 2 # Because the are uniquelly ID 1 and 2 in the period
P2 | 1 # Only counting ID 3, because ID 1 has already been counted in period 1
Please help me with the DAX measure I can use in the Pivot.
This is a measure written in DAX. It should work in a pivot table with the Period selected on the rows
DistinctID =
VAR PeriodsPerId =
SELECTCOLUMNS (
ALL ( T[ID] ),
"ID", T[ID],
"Period", CALCULATE ( MIN ( T[Period] ), ALLEXCEPT ( T, T[ID] ) )
)
RETURN
COUNTROWS ( FILTER ( PeriodsPerId, [Period] IN VALUES ( T[Period] ) ) )
It works first by preparing a table variable containing the minimum period per ID and then filtering this table for the Periods in the current selection.
Of course, if the Period is selected through a dimension, substitute the dimension in the last VALUES
Here's one way, which would require to reposition your columns as well as add a new column. This assumes you don't have duplicates in ID/Period combos. You didn't list any duplicates in your sample, so I'm making this assumption.
In my data, I have ID as column A and Period as column B.
Order your data by Period, ascending. Then in column C, you can use this formula to determine if that ID has been used before.
Cell C2 formula: =IF(VLOOKUP(A2,A:B,2,FALSE) = B2,1,0)
Copy it down and then create your pivot table, summing column C.
Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.
Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)
I wan't to select rows by the order of ASC and DESC but cassandra data orders are fixed .
I use ScyllaDB.
My imaginary scenario of problem :
I have a table :
CREATE TABLE tbl(A text , B text , C text , primary key(A,B,C))
After inserting datas my table is :
Now i want to select top 1 ( or x ) item of row ( A - B - 3 )
And after that select bottom 1 ( or x ) item of row ( A - B - 3 ).
C order is ASC and it's fixed !
now i try to select bottom 1 item :
SELECT * FROM tbl WHERE A='A' AND B='B' AND C > '3' LIMIT 1 ;
but selecting top of ( A-B-3 ) is my problem
SELECT * FROM tbl WHERE A='A' AND B='B' AND C < '3' ???
Is there any solution for selecting top of item in cassandra ?
I'm trying to return data as columns.
I've written this unpivot and pivot query:
`select StockItemCode, barcode, barcode2 from (select StockItemCode, col+cast(seq as varchar(20)) col, value from (
select
(select min(StockItemCode)
from RTLBarCode t2
where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t) d unpivot(
value
for col in (barcode) ) unpiv) src pivot ( max(value) for col in (barcode, barcode2)) piv;`
But the problem is only the "Barcode2" field are returning a value (the barcode field returns a null when in fact there is a value.
SAMPLE DATA
I have a Table called RTLBarCode
It has a field called Barcode and a field called StockItemCode
For StockItemCode = 10 I have 2 rows with a Barcode value of 5014721112824 and 0000000019149.
Can anyone see where I am going wrong?
Many thanks
You are indexing your barcode in unpiv.
This results in col's-values barcode1 and barcode2.
But then you are pivoting on barcode instead of barcode1. No value is found and the aggregate returns null.
The correct statement would be:
select StockItemCode, barcode1, barcode2 from
(
select StockItemCode, col+cast(seq as varchar(20)) col, value
from
(
select
(select min(StockItemCode)from RTLBarCode t2 where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t
) d
unpivot(value for col in (barcode)) unpiv
) src
pivot (max(value) for col in (barcode1, barcode2)) piv
I have created a KEYSPACE and a TABLE with a uuid column as primary key and a timestamp column using an index. All this succeeded like the following picture showed:
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:10:30', '111' );
cassandra#cqlsh:my_keyspace> insert into my_test ( id, insert_time, value ) values ( uuid(), '2015-03-12 09:20:30', '222' );
cassandra#cqlsh:my_keyspace> select * from my_test;
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
69579f6f-bf88-493b-a1d6-2f89fac25650 | 2015-03-12 09:10:30+0000 | 111
(2 rows)
and now query
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time = '2015-03-12 09:20:30';
id | insert_time | value
--------------------------------------+--------------------------+-------
9d7f88bc-5cb9-463f-b679-fd66e6469eb5 | 2015-03-12 09:20:30+0000 | 222
(1 rows)
and now query with less than:
cassandra#cqlsh:my_keyspace> select * from my_test where insert_time < '2015-03-12 09:20:30';
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: 'insert_time < <value>'"
while the first query is successful, why this happened? How should I make the second query successful since that's just what I want?
You can test all this on your own machine. Thanks
CREATE TABLE my_test (
id uuid PRIMARY KEY,
insert_time timestamp,
value text
) ;
CREATE INDEX my_test_insert_time_idx ON my_keyspace.my_test (insert_time);
Cassandra range queries are quite limited. It goes down to performance, and data storage mechanics. A range query must have the following:
Hit a (or few with IN) partition key, and include exact matches on all consecutive clustering keys except the last one in the query, which you can do a range query on.
Say your PK is (a, b, c, d), then the following are allowed:
where a=a1 and b < b1
where a=a1 and b=b1 and c < c1
The following is not:
where a=a1 and c < 1
[I won't go into Allow Filtering here...avoid it.]
Secondary indexes must be exact matches. You can't have range queries on them.