I just updated to Spring Boot 2.7.2 and the new H2 2.1.214.
The jOOQ version is 3.16.6 (pro).
Since then, I get a bad grammar SQL exception with a limit query.
If I understand it correctly, the keyword limit is no longer supported in H2 - instead, FETCH FIRST should be used.
dslContext.select( FOO.fields() ).from( FOO ).limit( 1 )
select "FOO".ID" from "FOO" limit ?
Stacktrace:
Caused by: org.h2.jdbc.JdbcSQLSyntaxErrorException: Syntax error in SQL statement "select ""FOO"".""ID"" from ""FOO"" limit [*]?"; SQL statement:
select "FOO"."ID" from "FOO" limit ? [42000-214]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:502)
at org.h2.message.DbException.getJdbcSQLException(DbException.java:477)
at org.h2.message.DbException.get(DbException.java:223)
at org.h2.message.DbException.get(DbException.java:199)
at org.h2.message.DbException.getSyntaxError(DbException.java:247)
at org.h2.command.Parser.getSyntaxError(Parser.java:898)
at org.h2.command.Parser.prepareCommand(Parser.java:572)
at org.h2.engine.SessionLocal.prepareLocal(SessionLocal.java:631)
at org.h2.engine.SessionLocal.prepareCommand(SessionLocal.java:554)
at org.h2.jdbc.JdbcConnection.prepareCommand(JdbcConnection.java:1116)
at org.h2.jdbc.JdbcPreparedStatement.<init>(JdbcPreparedStatement.java:92)
at org.h2.jdbc.JdbcConnection.prepareStatement(JdbcConnection.java:288)
at com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(ProxyConnection.java:337)
at com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(HikariProxyConnection.java)
at jdk.internal.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.performQueryExecutionListener(ConnectionProxyLogic.java:112)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.access$000(ConnectionProxyLogic.java:25)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic$1.execute(ConnectionProxyLogic.java:50)
at net.ttddyy.dsproxy.listener.MethodExecutionListenerUtils.invoke(MethodExecutionListenerUtils.java:42)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.invoke(ConnectionProxyLogic.java:47)
at net.ttddyy.dsproxy.proxy.jdk.ConnectionInvocationHandler.invoke(ConnectionInvocationHandler.java:25)
at jdk.proxy2/jdk.proxy2.$Proxy140.prepareStatement(Unknown Source)
at jdk.internal.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.performQueryExecutionListener(ConnectionProxyLogic.java:112)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.access$000(ConnectionProxyLogic.java:25)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic$1.execute(ConnectionProxyLogic.java:50)
at net.ttddyy.dsproxy.listener.MethodExecutionListenerUtils.invoke(MethodExecutionListenerUtils.java:42)
at net.ttddyy.dsproxy.proxy.ConnectionProxyLogic.invoke(ConnectionProxyLogic.java:47)
at net.ttddyy.dsproxy.proxy.jdk.ConnectionInvocationHandler.invoke(ConnectionInvocationHandler.java:25)
at jdk.proxy2/jdk.proxy2.$Proxy140.prepareStatement(Unknown Source)
at org.quickperf.sql.connection.QuickPerfDatabaseConnection.prepareStatement(QuickPerfDatabaseConnection.java:62)
at jdk.internal.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.springframework.jdbc.datasource.TransactionAwareDataSourceProxy$TransactionAwareInvocationHandler.invoke(TransactionAwareDataSourceProxy.java:238)
at jdk.proxy2/jdk.proxy2.$Proxy320.prepareStatement(Unknown Source)
at org.jooq.impl.ProviderEnabledConnection.prepareStatement(ProviderEnabledConnection.java:109)
at org.jooq.impl.SettingsEnabledConnection.prepareStatement(SettingsEnabledConnection.java:82)
at org.jooq.impl.AbstractResultQuery.prepare(AbstractResultQuery.java:210)
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:307)
... 92 more
Is this a known issue?
Is there a way to rewrite the query?
This seems to be a frequent misconception about how to use jOOQ with H2 as a test database, so I've written a blog post about jOOQ's stance on this. The feature was also requested on the jOOQ issue tracker as #13895
TL;DR: With jOOQ, please don't use H2's compatibility modes.
Valid configurations
There is only really 1 valid configuration when using H2 as a test database product for a different production database product:
Use jOOQ's SQLDialect.H2 and H2 native, without compatibility mode. This is integration tested by jOOQ.
You might be tempted to:
Use jOOQ's SQLDialect.SQLSERVER and H2 with SQL Server compatibility mode. This is not integration tested by jOOQ (see details below). I don't recommend doing this, because it is likely you'll run into a limitation of H2's compatibility mode that jOOQ assumes is there, because jOOQ thinks it's an actual SQL Server instance.
Use SQLDialect.H2 with H2 in compatibility mode doesn't really make sense, because, well, LIMIT is valid in H2, but not in SQL Server, which supports only TOP or OFFSET .. FETCH. So you're not going to get a good SQL dialect match. See e.g.: https://github.com/h2database/h2database/issues/3537
Some background on compatibility modes and using H2 as a test DB
The assumption in jOOQ is when you use H2, then you use H2 natively (e.g. as an in-memory database, also in production). Using H2 as a simple test database isn't the primary use-case for H2, even if it has seen a lot of popularity in recent years. But why not just use a testcontainers based approach instead, and develop / integration test only with your production RDBMS? The benefits are obvious:
You don't have any such infrastructure problems and test artifacts
You get to use all the vendor specific features, among which table valued functions, table valued parameters, XML/JSON, etc.
The H2 compatibility modes are to make sure your native SQL Server queries work on H2 without any changes. This is useful for purely JDBC based applications. But since you're using jOOQ, and SQLDialect.H2, why keep a compatibility mode around in H2? jOOQ already handles the translation between dialects, if you want to continue using H2 as a test database product. But again, I think your life will be simpler if you're using testcontainers. You can even use it to generate jOOQ code as shown here.
Related
When trying to have spark (3.1.1) write to S3 bucket partitioned data using the S3A committers I am getting an error:
Caused by: java.lang.IllegalStateException: Cannot parse URI s3a://partition-spaces-test-bucket/test_spark_partitioning_s3a_committers/City=New York/part-00000-7d95735c-ecc4-4263-86fe-51263b45bbf2-73dcb7a0-7da5-4f45-a12f-e57face31212.c000.snappy.parquet
at org.apache.hadoop.fs.s3a.commit.files.SinglePendingCommit.destinationPath(SinglePendingCommit.java:255)
at org.apache.hadoop.fs.s3a.commit.files.SinglePendingCommit.validate(SinglePendingCommit.java:195)
at org.apache.hadoop.fs.s3a.commit.files.PendingSet.validate(PendingSet.java:146)
at org.apache.hadoop.fs.s3a.commit.files.PendingSet.load(PendingSet.java:109)
at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitter.lambda$loadPendingsetFiles$4(AbstractS3ACommitter.java:478)
at org.apache.hadoop.fs.s3a.commit.Tasks$Builder$1.run(Tasks.java:254)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
This is caused because of the space in the partition column value I am using.
When using the default spark FileOutputCommitter this is working and spark creates the directory with space in the name.
The S3A committer used the java.net.URI object to create the org.apache.hadoop.fs.Path object and the URI is the one throwing URISyntaxException because of this space.
My question is why did the S3A committer developers choose to use URI for the path and not create the Path directly from the string like done in the FileOutputCommitter, is there a good reason to do so?
And how can I overcome this assuming I don't want to change the values of this column by replacing space with another char such as underscore?
My question is why did the S3A committer developers choose to use URI for the path and not create the Path directly from the string like done in the FileOutputCommitter, is there a good reason to do so?
That is a good question. We do it that way because we have to marshall the list of paths from the workers to the job committer, which we do in JSON files. And that marshalling didn't round-trip properly with spaces. This has been found and fixed in HADOOP-17112 whitespace not allowed in paths when saving files to s3a via committers
One interesting question is: why didn't anybody notice? And that is because nobody else uses space in the partitioning. Not in any of the tests, TCP-DS benchmarks etc. One of those little assumptions which we developers had that turns out to not always hold. As well as fixing the issue we now make sure our tests do have parts with spaces in them to stop it ever coming back.
And how can I overcome this assuming I don't want to change the values of this column by replacing space with another char such as underscore?
upgrade to hadoop-3.3.1 binaries
Note that all this code is open source, the fix for the bug was actually provided by the same person who identified the problem. While you are free to criticize the authors of the feature, we depend on reviews and testing from others. If we don't get that we can't guarantee our code meets the more obscure needs of some people. In particular, for the object stores, and the configurations space off all the S3 Store options (region, replication, IAM restrictions, encryption) and client connectivity options (proxy, AWS access point, ...) mean that it is really hard to get full coverage of the possible configurations. Anything you can do to help qualifying releases -or even better, check out build and test the modules before we entered the release phase, is always appreciated. It's the only way you can be confident that things are going to work in your world.
While I suppose I could fork Cassandra and modify it to taste, is there an easier way to intercept CQL and reject it?
The whys and wherefores are too long to go into, but for several reasons I would like to enforce certain requirements on CQL run starting with performance.
To get started with a simple example, what'd be the easiest way to:
reject CQL with 'allow filtering'
reject CQL-SELECT without a where clause
The key point is enforcement ... in or near the server and not rely on programmers to exclude these cases.
TIA
Cassandra doesn't have such built-in capability right now. There is a movement in corresponding direction in the DSE 6.8 where guardrails exist, but it's not an open source.
But you can at least monitor queries via audit functionality of Cassandra 4.0 (not released yet), or via Ericsson's ecaudit plugin for 2.2/3.0/3.11.
I am testing the UDF / UDA feature in Cassandra, It seems good. But I have few questions in using it.
1) In the Cassandra.yaml, It is mentioned that sandboxing is enabled for avoiding the evil code, So are we violating the rule and what will be the consequences of enabling this support (flag)?
2) What are the advantages of using UDF / UDA in Cassandra compared to reading the data and writing the aggregation logic in client side?
3) Also, apart from JAVA, Is there a language support available for nodejs, python in writing UDF / UDA?
Thanks,
Harry
Here are some comments:
Sandboxing prevents execution of "dangerous" code - working with files/sockets, starting threads, etc. This blog post provides some additional details about it.
It could be several - you don't move data from coordinator node to your app, you offload calculations to cassandra cluster, etc.
Languages supporting JSR 223 "Scripting for Java" - JavaScript, Groovy, JRuby, Jython, ... (with enable_scripted_user_defined_functions set to true in Cassandra config). But Java should be the fastest.
Also look to this presentation about UDF/UDA from author of this functionality (Robert Stupp) & this blog post with more details & examples.
I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!
As I understand, GridGain 6 has some customized serialization and also utlizes H2 for various purposes.
We use H2 as a serialized object store. For instance, here is the relevant part of a table schema.
CREATE TABLE IF NOT EXISTS QUEUE (ID IDENTITY PRIMARY KEY, OBJECT OTHER NOT NULL ....)
When attempting to insert a row, I get the following error. The last few lines indicate that the GridH2IndexingSpi is configured and is failing on something (even though my test isn't running on the Grid). I couldn't easily debug further since the spi source and my debugger seem out of sync and the line numbers are meaningless.
From what I was able to debug in Utils.java, It appears that the gridgain serializer has been configured in H2 (statically !!!!) and is being used.
Any thoughts on how to resolve or avoid this situation? I've tried various H2 versions such as 1.3.176 (which gridgain uses) and the newer 1.4.177, but as expected, they don't make any difference since the issue is with the use the indexing spi.
I can try and create a small H2 / Gridgain project project to illustrate the issue, if that would help.
Thanks
Exception in thread "pool-4-thread-1" org.springframework.jdbc.UncategorizedSQLException: PreparedStatementCallback; uncategorized SQLException for SQL []; SQL state [90026]; error code [90026]; Serialization failed, cause: "java.lang.NullPointerException"; SQL statement:
INSERT INTO QUEUE (OBJECT....) VALUES (?,?,?,?) [90026-170]; nested exception is org.h2.jdbc.JdbcSQLException: Serialization failed, cause: "java.lang.NullPointerException"; SQL statement:
INSERT INTO QUEUE (OBJECT, ....) VALUES (?,?,?,?) [90026-170]
....
Caused by: org.h2.jdbc.JdbcSQLException: Serialization failed, cause: "java.lang.NullPointerException"; SQL statement:
INSERT INTO QUEUE (OBJECT, ....) VALUES (?,?,?,?) [90026-170]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:329)
at org.h2.message.DbException.get(DbException.java:158)
....
Caused by: java.lang.NullPointerException
at org.gridgain.grid.spi.indexing.h2.GridH2IndexingSpi.access$100(GridH2IndexingSpi.java:145)
at org.gridgain.grid.spi.indexing.h2.GridH2IndexingSpi$1.serialize(GridH2IndexingSpi.java:201)
at org.h2.util.Utils.serialize(Utils.java:273)
... 27 more
I finally understood what's happening. At the point when GridGain H2 integration was implemented, H2 had only one static serializer. That's why GridGain is using a static property. As a possible workaround for the problem can you try setting in your H2 database your custom serializer which will in fact utilize usual H2 serialization?
Latest H2 version supports specifying per-database serializer and GridGain will fix that in upcoming version.
I am not sure what it is you are doing with H2, but you should not utilize the same H2 database as used by GridGain internally.
GridGain utilizes H2 internally exclusively for SQL indexing and querying capabilities.