Difference between fetch().stream().findFirst() vs fetchAnyInto - jooq

Can someone let me know what is the difference between the below queries? If both are same which one should be preferred over the other?
Query 1
dslContext
.selectFrom(AIR_TRANSACTION)
.where(
AIR_TRANSACTION
.TICKET_NUMBER
.eq(ticketNumber)
.and(AIR_TRANSACTION.SPOTNANA_PNR_ID.eq(pnrId)))
.orderBy(AIR_TRANSACTION.TRANSACTION_DATE.desc())
.fetch()
.stream()
.findFirst()
.map(r -> r.into(AirTransaction.class));
Query 2
Optional.ofNullable(
dslContext
.selectFrom(AIR_TRANSACTION)
.where(
AIR_TRANSACTION
.TICKET_NUMBER
.eq(ticketNumber)
.and(AIR_TRANSACTION.SPOTNANA_PNR_ID.eq(pnrId)))
.orderBy(AIR_TRANSACTION.TRANSACTION_DATE.desc())
.fetchAnyInto(AirTransaction.class));
JOOQ Version - 3.17.7
Postgres Version - 14.6

The two are the same, logically, but not necessarily from a performance perspective. Think about what each operation does:
Query 1
.fetch() fetches all data from the JDBC ResultSet into memory as a Result<AirTransactionRecord> (this includes going through the ExecuteListener::recordStart lifecycle for every record!)
.stream() is lazy, implemented by ArrayList::stream and doesn't do anything on its own
.findFirst() is a terminal operation on the stream that discards all elements after the first
.map() maps 0..1 AirTransactionRecord into the AirTransaction POJO
Query 2
fetchAnyInto() fetches only 0..1 elements from the JDBC ResultSet into memory as AirTransactionRecord, then, maps it into the AirTransaction POJO. (this only includes going through the ExecuteListener::recordStart lifecycle for that single record)
In the future, the intermediate AirTransactionRecord might even be skipped, as it might not be necessary, see #6737
The difference
The difference only manifests when there are more than 1 records in the ResultSet, in case of which Query 2 is much faster than Query 1. This is unlikely what you want, so in such a case, it would be a good idea to add a limit(1) clause to prevent more than 1 records to be sent from SQL.
Other than that, the two approaches are effectively equivalent.
Alternative
There's also .fetchOptionalInto(AirTransaction.class), which you could call, to skip having to manually wrap things in an optional. The semantics of that is slightly different in case the query produces multiple records, though, as it may throw a TooManyRowsException

Related

Best way to Fetch N rows in ScyllaDB

I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.

Inconsistent results when sorting documents by _doc

I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.
The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.

How to resolve ArangoDB Foxx deadlocking problems?

Under testing, our Foxx app is getting into "deadlock detected" problems. These seem to be caused by traversal queries. Apriori, it is difficult if not impossible to know which tables will be used during traversal. However, I did take one specific case where I could determine the number of tables and wrap the AQL in a transaction for testing:
var result = db._executeTransaction({ "collections" : { "read" : [ "pmlibrary", "pmvartype", "pmvariant", "pmproject", "pmsite", "pmpath", "pmattic" ] }, "action" : "function(){var db=require(\"#arangodb\").db;var res=db._query(\"FOR o IN ['pmlibrary/199340787'] FOR v,e,p IN 0..7 INBOUND o pm_child RETURN p.vertices\");return res.toArray()}" });
FYI, the list of tables in the collections do not include the edge tables.
However, the deadlocks on this statement continue. I'm not sure what to try next. Thanks.
I'm not an indexing/performance expert with ArangoDB, but I recently had some issues with deadlocking too and by reconstructing and reshaping the queries had huge performance gains, sometimes queries that took 120 seconds then took 0.2 seconds.
A key thing I did was try to help ArangoDB know when to use an Index, and make sure that index is there to be used.
In your example query, there is a problem that stops ArangoDB from knowing an index occurs:
FOR o IN ['pmlibrary/199340787']
FOR v,e,p IN 0..7 INBOUND o pm_child
RETURN p.vertices
The key issue with this is that your original FOR loop is using a value out of an array, which may hinder ArangoDB identifying an index. You could easily replace your o with a parameter, and put in your 'pmlibrary/199340787' value directly, and with the Explain button see if it identifies it can use an index.
One issue I found was trying to make it super clear to use an index, which sometimes meant adding additional LET commands to allow ArangoDB to build the contents of arrays, and then it seems to know what index to use better.
If you were to test the performance of the query above, versus something like:
LET my_targets = (FOR o IN pmlibrary FILTER o._key == '199340787' RETURN o._id)
FOR o IN my_targets
FOR v,e,p IN 0..7 INBOUND o._id pm_child
RETURN p.vertices
It may seem counter intuitive, but with testing I found amazing performance gains by breaking queries apart to help ArangoDB notice it could use an Index.
If any queries are against the values within Arrays or if you are using dynamic attribute names, then it won't always use an index even if one exists.
Also I found that the server will cache response for queries, so if you want to test changes to a long running query and want to avoid cache hits on your second + queries, I had to stop and restart the arangodb3 service.
I hope that helps give you a target for shuffling around your query to work with indexes.
Remember you already have inbuilt indexes on _id, _key, _from, _to, so you need to try and make sure it's going to use other indexes that can help speed up your query.
The proper solution was to use a WITH clause rather than a transaction.

Is a read with one secondary index faster than a read with multiple in cassandra?

I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.

How to implement custom pagination with LINQ2Entities using 1 call to the database

I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?
In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.

Resources