RANK() OVER Partition with resetting in dependence of another column - presto

I have in a table 2 different IDs and the timestamp, which I would like to rank. But the peculiarity is that I want to rank the S_ID until there is an entry at O_ID. Once there is an entry at O_ID, I want the next rank at S_ID to start at 1.
Here is an example:
select
S_ID,
timestamp,
O_ID,
rank() OVER (PARTITION BY S_ID ORDER BY timestamp asc) AS RANK
from table
order by S_ID, timestamp;
S_ID
Timestamp
O_ID
Rank
2e114e9f
2021-11-26 08:57:44.049
NULL
1
2e114e9f
2021-12-26 17:07:26.272
NULL
2
2e114e9f
2021-12-27 08:13:24.277
NULL
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
2e114e9f
2021-12-30 13:41:28.821
NULL
5
2e114e9f
2021-12-30 19:53:28.590
NULL
6
2e114e9f
2022-02-05 09:50:54.104
2333002
7
2e114e9f
2022-02-19 10:14:31.389
NULL
8
How can I now add another rank in dependence of an entry in the column O_ID?
So the outcome should be:
S_ID
Timestamp
O_ID
Rank S_ID
Rank both
2e114e9f
2021-11-26 08:57:44.049
NULL
1
1
2e114e9f
2021-12-26 17:07:26.272
NULL
2
2
2e114e9f
2021-12-27 08:13:24.277
NULL
3
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
4
2e114e9f
2021-12-30 13:41:28.821
NULL
5
1
2e114e9f
2021-12-30 19:53:28.590
NULL
6
2
2e114e9f
2022-02-05 09:50:54.104
2333002
7
3
2e114e9f
2022-02-19 10:14:31.389
NULL
8
1
I am happy about any food for thought!!!!

Looks like the gaps and islands approach can be helpful here - use lag to split data into groups (based on current and previous equality with some null handling) and then use group value as partition for the rank() function.
-- sample data
WITH dataset (S_ID, Timestamp, O_ID) AS (
VALUES ('2e114e9f', timestamp '2021-11-26 08:57:44.049', NULL),
('2e114e9f', timestamp '2021-12-26 17:07:26.272', NULL),
('2e114e9f', timestamp '2021-12-27 08:13:24.277', NULL),
('2e114e9f', timestamp '2021-12-29 11:32:56.952', 2287549),
('2e114e9f', timestamp '2021-12-30 13:41:28.821', NULL),
('2e114e9f', timestamp '2021-12-30 19:53:28.590', NULL),
('2e114e9f', timestamp '2022-02-05 09:50:54.104', 2333002),
('2e114e9f', timestamp '2022-02-19 10:14:31.389', NULL)
)
--query
select S_ID,
Timestamp,
O_ID,
rank() OVER (PARTITION BY S_ID, grp ORDER BY timestamp asc) AS RANK
from(
select *,
sum(if(prev is not null and (O_ID is null or O_ID != prev), 1, 0))
OVER (PARTITION BY S_ID ORDER BY timestamp asc) as grp
from (
select *,
lag(O_ID) OVER (PARTITION BY S_ID ORDER BY timestamp asc) AS prev
from dataset
)
)
Output:
S_ID
Timestamp
O_ID
RANK
2e114e9f
2021-11-26 08:57:44.049
1
2e114e9f
2021-12-26 17:07:26.272
2
2e114e9f
2021-12-27 08:13:24.277
3
2e114e9f
2021-12-29 11:32:56.952
2287549
4
2e114e9f
2021-12-30 13:41:28.821
1
2e114e9f
2021-12-30 19:53:28.590
2
2e114e9f
2022-02-05 09:50:54.104
2333002
3
2e114e9f
2022-02-19 10:14:31.389
1

Related

How to bulild a null notnull matrix in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
Here's my expected output
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7
Assuming Null is NaN, here's one option. Using isna + sum to count the NaNs, then find the difference between df length and number of NaNs for Notnulls. Then construct a DataFrame.
nulls = df.drop(columns='Id').isna().sum()
notnulls = nulls.rsub(len(df))
out = pd.DataFrame.from_dict({'Null':nulls, 'Notnull':notnulls}, orient='index')
out['Total'] = out.sum(axis=1)
If you're into one liners, we could also do:
out = (df.drop(columns='Id').isna().sum().to_frame(name='Nulls')
.assign(Notnull=df.drop(columns='Id').notna().sum()).T
.assign(Total=lambda x: x.sum(axis=1)))
Output:
Column_A Column_B Column_C Total
Nulls 2 1 2 5
Notnull 2 3 2 7
Use Series.value_counts for non missing values:
df = (df.replace('Null', np.nan)
.set_index('Id', 1)
.notna()
.apply(pd.value_counts)
.rename({True:'Notnull', False:'Null'}))
df['Total'] = df.sum(axis=1)
print (df)
Column_A Column_B Column_C Total
Null 2 1 2 5
Notnull 2 3 2 7

How to do null combination in pandas dataframe

Here's my dataset
Id Column_A Column_B Column_C
1 Null 7 Null
2 8 7 Null
3 Null 8 7
4 8 Null 8
If at least one column is null, Combination will be Null
Id Column_A Column_B Column_C Combination
1 Null 7 Null Null
2 8 7 Null Null
3 Null 8 7 Null
4 8 Null 8 Null
Assuming Null is NaN, we could use isna + any:
df['Combination'] = df.isna().any(axis=1).map({True: 'Null', False: 'Notnull'})
If Null is a string, we could use eq + any:
df['Combination'] = df.eq('Null').any(axis=1).map({True: 'Null', False: 'Notnull'})
Output:
Id Column_A Column_B Column_C Combination
0 1 Null 7 Null Null
1 2 8 7 Null Null
2 3 Null 8 7 Null
3 4 8 Null 8 Null
Use DataFrame.isna with DataFrame.any and pass to numpy.where:
df['Combination'] = np.where(df.isna().any(axis=1), 'Null','Notnull')
df['Combination'] = np.where(df.eq('Null').any(axis=1), 'Null','Notnull')

Cassandra Partition key duplicates?

I am new to Cassandra so I had a few quick questions, suppose I do this:
CREATE TABLE my_keyspace.my_table (
id bigint,
year int,
datetime timestamp,
field1 int,
field2 int,
PRIMARY KEY ((id, year), datetime))
I imagine Cassandra as something like Map<PartitionKey, SortedMap<ColKey, ColVal>>,
My question is when querying for something from Cassandra using a WHERE, it will be like:
SELECT * FROM my_keyspace.my_table WHERE id = 1 AND year = 4,
This could return 2 or more records, how does this fit in with the data model of Cassandra?
If it really it a Big HashMap how come duplicate records for a partition key are allowed?
Thanks!
There is a batch of entries in the SortedMap<ColKey, ColVal> for each row, using its sorted nature.
To build on your mental model, while there is only 1 partition key for id = 1 AND year = 4 there are multiple cells:
(id, year) | ColKey | ColVal
------------------------------------------
1, 4 | datetime(1):field1 | 1 \ Row1
1, 4 | datetime(1):field2 | 2 /
1, 4 | datetime(5):field1 | 1 \
1, 4 | datetime(5):field2 | 2 / Row2
...

Spark find NULL value blocks in Series of values

Assume this is my data:
date value
2016-01-01 1
2016-01-02 NULL
2016-01-03 NULL
2016-01-04 2
2016-01-05 3
2016-01-06 NULL
2016-01-07 NULL
2016-01-08 NULL
2016-01-09 1
I am trying to find the start and end dates that surround the NULL-value groups. An example output would be as follows:
start end
2016-01-01 2016-01-04
2016-01-05 2016-01-09
My first attempt at the problem produced the following:
df.filter($"value".isNull)\
.agg(to_date(date_add(max("date"), 1)) as "max",
to_date(date_sub(min("date"),1)) as "min"
)
but this only finds the total min and max value. I thought of using groupBy but don't know how to create a column for each of the null value blocks.
The tricky part is to get the borders of the groups, therefore you need several steps.
first to build groups of nulls/not-nulls (using window-functions)
then group by blocks to get the borders within the blocks
then again window-function to extend the borders
Here a working example:
import ss.implicits._
val df = Seq(
("2016-01-01", Some(1)),
("2016-01-02", None),
("2016-01-03", None),
("2016-01-04", Some(2)),
("2016-01-05", Some(3)),
("2016-01-06", None),
("2016-01-07", None),
("2016-01-08", None),
("2016-01-09", Some(1))
).toDF("date", "value")
df
// build blocks
.withColumn("isnull", when($"value".isNull, true).otherwise(false))
.withColumn("lag_isnull", lag($"isnull",1).over(Window.orderBy($"date")))
.withColumn("change", coalesce($"isnull"=!=$"lag_isnull",lit(false)))
.withColumn("block", sum($"change".cast("int")).over(Window.orderBy($"date")))
// now calculate min/max within groups
.groupBy($"block")
.agg(
min($"date").as("tmp_min"),
max($"date").as("tmp_max"),
(count($"value")===0).as("null_block")
)
// now extend groups to include borders
.withColumn("min", lag($"tmp_max", 1).over(Window.orderBy($"tmp_min")))
.withColumn("max", lead($"tmp_min", 1).over(Window.orderBy($"tmp_max")))
// only select null-groups
.where($"null_block")
.select($"min", $"max")
.orderBy($"min")
.show()
gives
+----------+----------+
| min| max|
+----------+----------+
|2016-01-01|2016-01-04|
|2016-01-05|2016-01-09|
+----------+----------+
I don't have a working solution but I do have a few recommendations.
Look at using a lag; you will also have to change that code a bit to produce a lead column.
Now assume you have your lag and lead column. Your resultant dataframe will now look like this:
date value lag_value lead_value
2016-01-01 1 NULL 1
2016-01-02 NULL NULL 1
2016-01-03 NULL 2 NULL
2016-01-04 2 3 NULL
2016-01-05 3 NULL 2
2016-01-06 NULL NULL 3
2016-01-07 NULL NULL NULL
2016-01-08 NULL 1 NULL
2016-01-09 1 1 NULL
Now what you want to do is just filter by the following conditions:
min date:
df.filter("value IS NOT NULL AND lag_value IS NULL")
max date:
df.filter("value IS NULL AND lead_value IS NOT NULL")
If you want to be a bit more advanced, you can also use a when command to create a new column which states if the date is a start or end date for a null group:
date value lag_value lead_value group_date_type
2016-01-01 1 NULL 1 start
2016-01-02 NULL NULL 1 NULL
2016-01-03 NULL 2 NULL NULL
2016-01-04 2 3 NULL end
2016-01-05 3 NULL 2 start
2016-01-06 NULL NULL 3 NULL
2016-01-07 NULL NULL NULL NULL
2016-01-08 NULL 1 NULL NULL
2016-01-09 1 1 NULL end
This can be created with something that looks like this:
from pyspark.sql import functions as F
df_2.withColumn('group_date_type',
F.when("value IS NOT NULL AND lag_value IS NULL", start)\
.when("value IS NULL AND lead_value IS NOT NULL", end)\
.otherwise(None)
)

Cassandra: how to initialize the counter column with value?

I have to benchmark Cassandra with the Facebook Linkbench. There are two phase during the Benchmark, the load and the request phase.
in the Load Phase, Linkbench fill the cassandra tables : nodes, links and counts (for links counting) with default values(graph data).
the count table looks like this:
keyspace.counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
my question is how to insert the default counter values (before incrementing and decrementing the counter in the Linkbench request phase) ?
If it isn't possible to do that with cassandra, how should i increment/decrement a bigint variable (instead of counter variable)
Any suggest and comments? Thanks a lot.
The default value is zero. Given
create table counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
);
and
update counttable set count = count + 1 where link_id = 1 and link_type = 1 and time = 1 and version = 1;
We see that the value of count is now 1.
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
(1 rows)
So, if we want to set it to some other value we can:
update counttable set count = count + 500 where link_id = 1 and link_type = 1 and time = 1 and version = 2;
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
1 | 1 | 1 | 2 | 500
(2 rows)
There is no elegant way to initialize a counter column with a non-zero value. The only operation you can do on a counter field is increment/decrement. I recommend to keep the offset (e.g. the your intended initial value) in a different column, and simply add the two values in your client application.
Thank you for the Answers. I implemented the following solution to initialize the counter Field.
as the initial and default value of the counter Field is 0 ,i incremented it with my default value. it looks like Don Branson solution but with only one column:
create table counttable (
link_id bigint,
link_type bigint,
count counter,
PRIMARY KEY (link_id, link_type)
);
i set the value with this statement (during the load phase):
update counttable set count = count + myValue where link_id = 1 and link_type = 1;
select * from counttable ;
link_id | link_type | count
---------+-----------+--------
1 | 1 | myValue (added to 0)
(1 row)

Resources