Duplicated ORDER BY Data and Conditions In Query Every Time Called - jooq

I've noticed this before and long ago stopped having my Conditions be private static finals, but I'm scratching my head on this one because it's a bit more problematic, especially if I'm debugging a query. Here's the example (and I've seen this before in 3.3.x, although I'm currently at 3.7.3):
final SelectJoinStep<Record13<String, String, String, String, String, BigDecimal, BigDecimal, BigDecimal, String, BigDecimal, String, String, Byte>> query = getSelect()
.from(getFrom(coConditions,
ConditionUtils.buildCondition(cortConditions, removeCortBySpeciality),
ConditionUtils.buildCondition(cosConditions, removeCosBySensitivity), tuConditions,
uaConditions, enableBlackMajik));
final SortField<?>[] orders = new SortField[] {DSL.inline(Integer.valueOf(2)).asc(),
DSL.inline(Integer.valueOf(1)).asc(), DSL.inline(Integer.valueOf(6)).asc()};
if (cosConditions.isPresent()) {
User.logger.error(builder.renderInlined(query.where(cosConditions.get()).orderBy(orders)));
return query.where(cosConditions.get()).orderBy(orders);
}
User.logger.error(builder.renderInlined(query.orderBy(orders)));
return query.orderBy(orders);
Here's the SQL snippet from the logger call just showing the ORDER BY:
order by
2 asc,
1 asc,
6 asc
And here's the SQL snippet of the ORDER BY that is sent to the SQL server:
order by
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc
Now, to further show the fun, here's another code snippet written just to demonstrate the problem:
User.logger.error(builder.renderInlined(query.orderBy(orders)));
User.logger.error(builder.renderInlined(query.orderBy(orders)));
User.logger.error(builder.renderInlined(query.orderBy(orders)));
return query.orderBy(orders);
First logger call:
order by
2 asc,
1 asc,
6 asc
Second logger call:
order by
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc
Third logger call:
order by
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc
What the DB sees:
order by
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc,
2 asc,
1 asc,
6 asc
Now, again, I actually have noticed this kind of behaviour before with my Conditions, where each time I call a Condition it is replicated to the point where I build my conditions and only reference them once (static ones made for some fun). Does anyone know why I'm seeing this behaviour (and see like behaviour with Conditions)?

This is due to an API design flaw in jOOQ that jOOQ has been carrying around for quite a while and will be fixed with jOOQ 4.0 only (#2198).
In general, you cannot safely assume that the DSL API is immutable (although it should be). So, your consecutive calls to orderBy() will actually each add the ORDER BY column set, but you print only the first one, so you don't see that.
The current behaviour is explained here (scroll to "mutability"):
http://www.jooq.org/doc/latest/manual/sql-building/sql-statements/dsl-and-non-dsl

Related

Cassandra where clause as a tuple

Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.
Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)

Cassandra query max of a particular column for a particular ID

I am trying to write a Cassandra query and my use case is as follows
Let's say the table is
ID | Version
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
Now what I want is to get the latest version for all the IDs.
So the query should give me 2 rows. The first with Id:1 Version 2 and second with ID:2 Version:3
I tried a query like Select * from table where ID=1 and Version= MAX(Version) but it's not a valid syntax.
Can anybody help in this?
SELECT * FROM table WHERE ID = 1 LIMIT 1 would give you the highest version if your clustering key is Version ordered by descending.
CREATE TABLE table (
id int,
version int,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);

Cassandra select order by

I create table as this
CREATE TABLE sm.data (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, tid, ts)
) WITH CLUSTERING ORDER BY (tid ASC, ts DESC);
Before I did all select query with ts DESC so it was good. Now I also need select query with ts ASC in some cases. How do I accomplish that? Thank you
You can simply use ORDER BY ts ASC
Example :
SELECT * FROM data WHERE did = ? and tid = ? ORDER BY ts ASC
if you do this select
select * from data where did=1 and tid=2 order by ts asc;
you will end up with some errors
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
I have tested it against my local cassandra db
I would suggets altering the order of the primary key columns
the reason is that :
"Querying compound primary keys and sorting results ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY."
CREATE TABLE data2 (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, ts, tid)
) WITH CLUSTERING ORDER BY (ts DESC, tid ASC)
Now we are free to choose the type of ordering for TS
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts DESC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts ASC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
Another way would be either to create a new table or a materialized view , the later would lead behind the scene to data duplication anyway
hope that clear enough

As I know in range queries, Cassandra retrieves result ordered by culstring key. Can I change this behavior in my query?

I'm trying to store and retrieve last active sensors by this schema:
CREATE TABLE last_signals (
section bigint,
sensor bigint,
time bigint,
PRIMARY KEY (section, sensor)
);
Row of this table will be updated every few seconds and in the result hot sensors will remain in memtable. But what will happen when I get a run a query like this:
SELECT * FROM last_signals
WHERE section = ? AND time > ?
Limit ?
ALLOW FILTERING;
And the result will be something like this (Ordered by clustering key):
sect | sens | time
------+------+------
1 | 1 | 4
1 | 2 | 3
1 | 4 | 2
1 | 5 | 9
The first Question: Is this result guaranteed to be the same in all version? (I'm using 3.7) and the next one is that how I can change this behavior (with query option, modeling or etc.). Indeed I need to get last writes first without considering clustring-keys order. I think in this case my reads will be much faster.
I don't think there is any way to guarantee order besides using clustering keys. Thus your ALLOW FILTERING query is potentially costly and may even time out. You could consider the following schema:
CREATE TABLE last_signals_by_time (
section bigint,
sensor bigint,
time bigint,
dummy bool,
PRIMARY KEY ((section, sensor), time)
) WITH CLUSTERING ORDER BY (time DESC);
Instead of updates do inserts with TTL so that you do not have to clean up old entries manually. (The dummy field is needed in order for TTL to work)
And then just run your read queries per section/sensors in parallel:
SELECT * FROM last_signals_by_time
WHERE section = ? AND sensor = ?
LIMIT 1;

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.
I tried two different approaches. I included the update time of the sensor in the primary key:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Then I can select the list like this:
select * from sensors where customerid=0 order by changedate desc;
which results in this:
customerid | changedate | sensorid | value
------------+--------------------------+----------+-------
0 | 2015-07-10 12:46:53+0000 | 1 | 2
0 | 2015-07-10 12:46:52+0000 | 1 | 1
0 | 2015-07-10 12:46:52+0000 | 0 | 2
0 | 2015-07-10 12:46:26+0000 | 0 | 1
The problem is, I don't get only the latest results, but all the old values too.
If I remove the changedate from the primary key, the select fails all together.
InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"
Updating the sensor values is also no option:
update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"
This fails because changedate is part of the primary key.
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
Edit:
In the meantime I tried another approach, to only storing the latest value.
I used this schema:
CREATE TABLE sensors (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);
Before inserting the latest value, I would delete all old values
DELETE FROM sensors WHERE customerid=? and sensorid=?;
But this fails because changedate is NOT part of the WHERE clause.
The problem is, I don't get only the latest results, but all the old values too.
Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:
select * from sensors where customerid=0 order by changedate desc limit 10;
Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.
If I remove the changedate from the primary key, the select fails all together.
This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.
Updating the sensor values is also no option
Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:
update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())
Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?
You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.
CREATE TABLE latest_sensor_data (
customerid int,
sensorid int,
changedate timestamp,
value text,
PRIMARY KEY (customerid, sensorid)
);
In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.
Now doing the following query:
select * from latest_sensor_data where customerid=0
Will give you the latest value for every sensor for that customer.
I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).
Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.
If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

Resources