I have the following request:
let artPerCall = 20
let artPerUser = 2
let start = req.params.page
let query = `
SELECT * from
(
SELECT a.*, row_to_json(u.*) as userinfo,
row_number() over (partition by u.address order by a.date desc) as ucount
FROM artworks a INNER JOIN users u ON a.address = u.address
WHERE a.flag != ($1) OR a.flag IS NULL
) t
WHERE ucount <= ($2)
ORDER BY date DESC
LIMIT ${artPerCall} OFFSET ${(start-1) * artPerCall}`
pool.query(query, ["ILLEGAL", artPerUser])
.then(users => {
if (users) {
res.json(users.rows);
}
})
.catch(err => {
next(err);
})
Called through an API with the following path /artworks/paginate/1/20 where (/artworks/paginate/{page}/20)
The expected result is to get 20 results per call with a maximum of 2 entries per user.
The current result:
It seem that it return only 2 entries per user as expected but once it return 2 for a user on a page then no more result for the same user in the following pages even if they have entries.
Any idea what i'm missing?
It seem that it return only 2 entries per user as expected but once it return 2 for a user on a page then no more result for the same user in the following pages even if they have entries.
Correct, this is what the query does. It selects:
row_number() over (partition by u.address order by a.date desc) as ucount
...
WHERE ucount <= ($2)
If parameter $2 is set to 2 as it is in your example code, then it will select 2 entries per user, not more, before sorting and pagination. If the user has more entries they will be filtered out.
If you remove "WHERE ucount <= ($2)" then you'll simply get all the results ordered by date, but that doesn't sound like what you want.
However, what I think you want to achieve sounds a bit complicated. I'm not sure either it would be great for usability as the results would look quite random to the user. So you will need to describe exactly what you want, with example data.
For example, if you want to avoid one user posting a lot of items with the same date pushing all the other users down in the search results, limiting the number of results per user is a good idea, but perhaps a button "more from this user..." would be a better choice than pushing the users' items down to the next pages.
Suppose you have only two users, user1 posts 20 items with date "today" and user2 posted 10 items yesterday. Do you want 2 items from user1, then 2 items from user2, then the 18 remaining items from user1, then the 8 remaining items from user? Or will they be interleaved with each other somewhat, which will make the date order a bit random in the results?
EDIT
Here's a proposal:
SELECT * from
(
SELECT *,
row_number() over (partition by user_id order by date desc) as ucount
FROM artworks
) t
ORDER BY (ucount/3)::INTEGER ASC, date DESC
LIMIT 20 OFFSET 0;
"(ucount/3)::INTEGER" is 0 for the first two artworks of each user, then 1 for the next 3, then 2 for the next 3, etc. So the most recent 2 artworks of each user end up first, then the most recent 3 artworks of each user, etc.
Another one:
ORDER BY ucount<3 ASC, date DESC
This will put the most recent 2 artworks of each user first, then the rest is simply sorted by date.
Related
What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');
Let's say I have a table with the following fields:
customerid, transactiontime, transactiontype
I want to group a customer's transactions by time, and select the customerid and the count of those transactions. But rather than simply grouping all transaction times into certain increments (15 min, 30 min, etc.), for which I've seen various solutions here, I'd like to group a set a customer's transactions based on how soon each transaction occurs after the previous.
In other words, if any transaction occurs more than 15 minutes after a previous transaction, I'd like it to be grouped separately.
I expect the customer to generate a few transactions close together, and potentially generate a few more later in the day. So if those two sets of transactions occur more than 15, 30 minutes apart, they'll be grouped into separate windows. Is this possible?
Yes, you can do this using a window function in SQLite. This syntax is a bit new to me, but this is how it would start:
select customer_id,
event_start_minute,
sum(subgroup_start) over (order by customer_id, event_start_minute) as subgroup
from (
select customer_id,
event_start_minute,
case
when event_start_minute - lag(event_start_minute) over win > 15
then 1
else 0
end as subgroup_start
from t1
window win as (
partition by b
order by c
)
) as groups
order by customer_id, event_start_minute
I have a table as follows:
CREATE TABLE someTable (
user_id uuid,
id uuid,
someField string,
anotherField string,
PRIMARY KEY (user_id, id)
);
I know that there's a way to do paging in cassandra (https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/)
However, what I need to do is:
page trough entire table (it's large, so paging is required)
get all rows of a user_id
do something with these rows.
In short I need to fetch all the results of 1 user and do this for every record there is. (No, I don't have a unique list of user_ids here)
Also, I know I could do this programatically: paging through all the pages, assume i get it ordered by user_id, and append the last user_id (where rows are cut off) to the next page of results so data of that user gets in the same set.
However, I was hoping there would be a more elegant solution for this?
However, what I need to do is:
page trough entire table (it's large, so paging is required).
Assuming you don't know the **user_id**. And you want to fetch all the users data. To do this use token function to make a range query to get the user_ids.Displaying rows from an unordered partitioner with the TOKEN function something like select * from someTable where token(a_id) > token(other_id);
get all rows of a user_id
Now you know the user_id and want to fetch all the rows of that user_id. Use range query based on id starting from the MIN_UUID. Like:
select * from someTable where user_id = 123 and id > MIN_UUID limit 100
After that query choose the 100th uuid to fetch other rows. such that:
select * from someTable where user_id = 123 and id > [previous_quries_100th_id(uuid)] limit 100
Keep querying until you fetched all the rows.
do something with these rows.
It depends on you what you want to do with all of those rows. Use language specific ResultSet and iterate over rows to do something on there.
I have a table in MS ACCESS 2013 that looks like this:
Id Department Status FollowingDept ActualArea
1000 Thinkerers Thinking Thinkerer Thinkerer
1000 Drawers OnDrawBoard Drawers Drawers
1000 MaterialPlan To Plan MaterialPlan MaterialPlan
1000 Painters MatNeeded MaterialPlan
1000 Builders DrawsNeeded Drawers
The table gives follow to an ID which has to pass through five departments, each department with atleast 5 different status.
Each status has a FollowingDept value, like *Department* Thinkerers has the status MoreCoffeeNow which means *FollowingDept* Drawers.
All columns except for ActualArea are columns which values are gotten from the feed of a query.
ActualArea is an Expr where I inserted this logic:
Iif(FollowingDept = Department, FollowingDept, "")
My logic is simple, if the FollowingDept and Department coincide, then the ID's ActualArea gets the value from FollowingDept.
But as you can see, there can be rare cases where an ID is like my example above, where 3 departments coincide with the FollowingDept. This cases are rare, but I would like to add something like a priority to Access.
Thinkerers has the top priority, then MaterialPlan, then Drawers, then Builders and lastly Painters. So, following the same example, after ActualArea gets 3 values, Access will execute another query or subquery or whatever, where it will evaluate each value priority and only leave behind that one with the top priority. So in this example, Thinkerers gets the top priority, and the other two values are eliminated from the ActualArea column.
Please keep in mind there are over 500 different IDs, each id is repeated 5 times, so the total records to evaluate will be 2500.
You need another table, with the possible values for actualArea and the priorities as numbers, and then you can select with a JOIN and order on the priority:
SELECT TOP 1 d.*, p.priority
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE d.id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
The IS NOT NULL clause eliminates all of the rows where actualArea is empty. The TOP condition leaves only the row with the top priority.
You don't seem to have a primary key for your table. If you don't, then I'll give another query in a minute, but I would strongly advise you to go back and add a primary key to the table. It will save you an incredible amount of headache later. I did add such a key to my test table, and it's called pID. This query makes use of that pID to remove the records you need:
DELETE FROM departments WHERE pID NOT IN (
SELECT TOP 1 d.pID
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
)
If you can't add a primary key to the data and actualArea is assumed to be unique, then you can just use the actualArea values to perform the delete:
DELETE FROM departments WHERE actualArea NOT IN (
SELECT TOP 1 d.actualArea
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
) AND id = 1000
If actualArea is not going to be unique, then we'll need to revisit this answer. This answer also assumes that you already have the id number.
How to find the fifth order for each customer and return title_order or null if the customer doesn't have the fifth order
Tables are
customer with columns Id, firstname, lastname...
order with columns order_id, title_order, id_custmer, date...
It can be done only with a query or do I need to create a function
Thanks in advance
You can use OUTER APPLY with OFFSET-FETCH:
select c.firstname, oa.title_order
from customer c
outer apply(select title_order from order o
where o.id_custmer = c.Id
order by date
offset 4 ROW FETCH next 1 ROW only)oa