Spring Integration Jdbc OutboundGateway returning 1 record ONLY even with MaxRows(0) - spring-integration

Have an application which has some 30,000+ records in a table and for an analytic use-case we need to fetch all and keep iterating over the returned result for some computation. However, the Jdbc OutboundGateway is returning ONLY 1 record even with MaxRows(0) though there are 30,000+ records in the DB. The same returns n number of records as a List when we explicitly set the MaxRows() with a non-zero value.
Please share how this can be made to return all rows with MaxRows(0)?

That's probably how your JDBC driver works or how your RDBMS is configured for maxRows.
The logic there in JdbcOutboundGateway is like this:
if (this.maxRows != null) {
Assert.notNull(this.poller, "If you want to set 'maxRows', then you must provide a 'selectQuery'.");
this.poller.setMaxRows(this.maxRows);
}
where that JdbcPollingChannelAdapter has this logic:
By default it is private int maxRows = 0;
return new PreparedStatementCreatorWithMaxRows(preparedStatementCreator,
JdbcPollingChannelAdapter.this.maxRows);
And that one:
public PreparedStatement createPreparedStatement(Connection con) throws SQLException {
PreparedStatement preparedStatement = this.delegate.createPreparedStatement(con);
preparedStatement.setMaxRows(this.maxRows); // We can't mutate provided JdbOperations for this option
return preparedStatement;
}
Then PreparedStatement:
/**
* Sets the limit for the maximum number of rows that any
* {#code ResultSet} object generated by this {#code Statement}
* object can contain to the given number.
* If the limit is exceeded, the excess
* rows are silently dropped.
*
* #param max the new max rows limit; zero means there is no limit
* #throws SQLException if a database access error occurs,
* this method is called on a closed {#code Statement}
* or the condition {#code max >= 0} is not satisfied
* #see #getMaxRows
*/
void setMaxRows(int max) throws SQLException;
zero means there is no limit
The logic in the JdbcOutboundGateway in the end is like this:
if (list.size() == 1 && (this.maxRows == null || this.maxRows == 1)) {
payload = list.get(0);
}
So, we return one record only if ResultSet has only one element.
I doubt we can do anything from Spring Integration perspective, unless you want to try with an Integer.MAX_VALUE for this property since your JDBC communication does not honor PreparedStatement.setMaxRows() contract.

Related

jOOQ: Can you force LIMIT to be rendered as ROW_NUMBER() window function?

I see this bit in the jOOQ docs:
https://github.com/jOOQ/jOOQ/blob/d727e6c476e8b1cbed1c91fd3724936c73cd9126/jOOQ/src/main/java/org/jooq/SelectLimitStep.java#L135-L147
/**
* Add a <code>LIMIT</code> clause to the query
* <p>
* If there is no <code>LIMIT</code> or <code>TOP</code> clause in your
* RDBMS, this may be emulated with a <code>ROW_NUMBER()</code> window
* function and nested <code>SELECT</code> statements.
* <p>
* This is the same as calling {#link #limit(Number, Number)} with offset = 0, or
* calling <code>.limit(numberOfRows).offset(0)</code>
*/
I'm wondering if there is a setting to force-enable this option?
There seems to be a setting for the opposite, to convert ROW_NUMBER to LIMIT, but not LIMIT to ROW_NUMBER.
To get around this, I've written the below but if the ability exists in the codebase (and is probably implemented better) I'd like to take advantage of it:
fun wrapQueryInRowNumberSubquery(
stmt: SelectFinalStep<Record>,
limit: Int = 0,
offset: Int = 0
): SelectConditionStep<Record> {
stmt.query.addSelect(
DSL.rowNumber().over()
.partitionBy(DSL.field("*")) // custom logic here
.orderBy(DSL.field("*")) // custom logic here
.`as`("row_num")
)
return DSL.select(DSL.asterisk()).from(stmt)
.where(
DSL.field("row_num").greaterThan(
DSL.inline(offset)
)
)
.and(
DSL.field("row_num").lessThan(
DSL.inline(offset + limit)
)
)
}
There are currently (jOOQ 3.17) no flags to enable / disable individual emulations independently of the SQL dialect.

How to get all data from a message segment with Databricks Labs Smolder

I know that you can use the provided helper functions to retrieve subfields from message segments for example
val nameDf = df.select(segment_field("PID", 4).alias("name"))
Is there a way I can extract the entire message segment ("PID") instead of having to put in an index for each subfield?
PID|||d40726da-9b7a-49eb-9eeb-e406708bbb60||Heller^Keneth||||||140 Pacocha Way Suite 52^^Northampton^Massachusetts^^USA
segment_field as implemented here is a helper method to extract a single value given the index.
/**
* Extracts a field from a message segment.
*
* #param segment The ID of the segment to extract.
* #param field The index of the field to extract.
* #param segmentColumn The name of the column containing message segments.
* Defaults to "segments".
* #return Yields a new column containing the field of a message segment.
*
* #note If there are multiple segments with the same ID, this function will
* select the field from one of the segments. Order is undefined.
*/
def segment_field(segment: String,
field: Int,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
.getItem(field)
}
However, if you are interested in extracting all the field values regardless of index you may replicate this behaviour as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def segment_fields(segment: String,
segmentColumn: Column = col("segments")): Column = {
filter(segmentColumn, s => s("id") === lit(segment))
.getItem(0)
.getField("fields")
}
and use as such
val nameDf = df.select(segment_fields("PID").alias("values"))
You may then extract or transform the data as desired.

Alternative to count in Spark sql to check if a query return empty result

I know count action can be expensive in Spark, so to improve performance I'd like to have a different way just to check if a query can return any results
Here is what I did
var df = spark.sql("select * from table_name where condition = 'blah' limit 1");
var dfEmpty = df.head(1).isEmpty;
Is it a valid solution or is there any potential uncaught error if I use above solution to check query result? It is a lot faster though.
isEmpty is head of the data.. this is quite resonable to check empty or not and it was given by spark api and is optimized... Hence, I'd prefer this...
Also in the query I think limit 1 is not required.
/**
* Returns true if the `Dataset` is empty.
*
* #group basic
* #since 2.4.0
*/
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
I think this is ok, I guess you could also omit the limit(1) because this is also part of the implementation of df.isEmpty. See also How to check if spark dataframe is empty?.
Note that the solution with df.isEmpty does may not evaluate all columns. E.g. if you have an UDF for 1 column, this will probabely not execute and could throws exceptions on a real query. df.head(1).isEmpty on the other hand will evaluate all columns for 1 rows.

NoNodeAvailableException after some insert request to cassandra

I am trying to insert data into Cassandra local cluster using async execution and version 4 of the driver (as same as my Cassandra instance)
I have instantiated the cql session in this way:
CqlSession cqlSession = CqlSession.builder()
.addContactEndPoint(new DefaultEndPoint(
InetSocketAddress.createUnresolved("localhost",9042))).build();
Then I create a statement in an async way:
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
CompletableFuture<AsyncResultSet>[] result = data.stream().map(
(d) -> session.executeAsync(
ps.bind(d.p1,d.p2,d.p3,d.p4)
)
).toCompletableFuture()
).toArray(CompletableFuture[]::new);
return CompletableFuture.allOf(result);
}
);
data is a dynamic list filled with user data.
When I exec the code I get the following exception:
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
at com.datastax.oss.driver.api.core.AllNodesFailedException.fromErrors(AllNodesFailedException.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.sendRequest(CqlPrepareHandler.java:210)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.onThrottleReady(CqlPrepareHandler.java:167)
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.<init>(CqlPrepareHandler.java:153)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:66)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:33)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:210)
at com.datastax.oss.driver.api.core.cql.AsyncCqlSession.prepareAsync(AsyncCqlSession.java:90)
The node is active and some data are inserted before the exception rise. I have also tried to set up a data center name on the session builder without any result.
Why this exception rise if the node is up and running? Actually I have only one local node and that could be a problem?
The biggest thing that I don't see, is a way to limit the current number of active async threads.
Basically, if that (mapped) data stream gets hit hard enough, it'll basically create all of these new threads that it's awaiting. If the number of writes coming in from those threads creates enough back-pressure that node can't keep up or catch up to, the node will become overwhelmed and not accept requests.
Take a look at this post by Ryan Svihla of DataStax:
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
Its code is from the 3.x version of the driver, but the concepts are the same. Basically, provide some way to throttle-down the writes, or limit the number of "in flight threads" running at any given time, and that should help greatly.
Finally, I have found a solution using BatchStatement and a little custom code to create a chucked list.
int chunks = 0;
if (data.size() % 100 == 0) {
chunks = data.size() / 100;
} else {
chunks = (data.size() / 100) + 1;
}
final int finalChunks = chunks;
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
AtomicInteger counter = new AtomicInteger();
final List<CompletionStage<AsyncResultSet>> batchInsert = data.stream()
.map(
(d) -> ps.bind(d.p1,d.p2,d.p3,d.p4)
)
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / finalChunks))
.values().stream().map(
boundedStatements -> BatchStatement.newInstance(BatchType.LOGGED, boundedStatements.toArray(new BatchableStatement[0]))
).map(
session::executeAsync
).collect(Collectors.toList());
return CompletableFutures.allSuccessful(batchInsert);
}
);

Insert is 10 times faster than Update in Cassandra. Is it normal?

In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.

Resources