Oracle dialect for pagination - pagination

We are retrieving data from oracle using spring data JDBC
We are using org.springframework.data.relational.core.dialect.OracleDialect to retrieve data from database.
It is working as expected when we have a repository that uses CrudRepository
But if we modify the repository which extends PagingAndSortingRepository to retrieve based on the page number, we are getting an exception.
Based on the analysis, we identified that queries generated by LIMIT_CLAUSE and LOCK_CLAUSE do not adhere to Oracle.
Is there an Oracle dialect to generate a proper limit query which is an instance of org.springframework.data.relational.core.dialect.Dialect?

The only available OracleDialect is based on the AnsiDialect` and Oracle12c is supposed to support the ANSI standard.
Further investigation leads to the suspicion that the ANSI standard allows multiple variants and AnsiDialect creates a clause that does not work with Oracle12 although it gets accepted by OracleXE18 which is used for testing.
Spring Data JDBC currently creates clauses of the form OFFSET %d ROWS FETCH FIRST %d ROWS ONLY. Which according to https://dba.stackexchange.com/questions/30452/ansi-iso-plans-for-limit-standardization is conforming to the standard.
But https://stackoverflow.com/a/24046664/66686 hints that Oracle12 might require OFFSET %d ROWS FETCH NEXT %d ROWS ONLY
As a workaround you may register a custom dialect as described in https://spring.io/blog/2020/05/20/migrating-to-spring-data-jdbc-2-0#dialects

Related

Is Presto is a data store for storing data?

I am new to work on Presto. I have some doubts regarding Presto.
Whether Presto is a data store(database)?
If it is a query engine ? Whether there is any common query syntax for accessing Hive, SQL, Cassandra data using connectors or it will accept all data source queries based on connectors ?
Where the query execution will takes place in Presto or in connected data source end?
It is a query engine. However it accesses data from many different data sources.
Yes. It is ANSI SQL. When accessing data from underlying data source then it's specific interface is used (thrift, hdfs, jdbc etc), but this is hidden from the user.
In both places. Presto is capable to push down some data filtering down to underlying data source (projection, where clauses). There is current effort to also push more parts of SQL query (see https://github.com/prestosql/presto/issues/18). Rest is evaluated in Presto.

How to create Accumulo Embedded Index with Rounds strategy?

I am a beginner in Accumulo and using Accumulo 1.7.2.
As an Indexing strategy, I am planning to use Embedded Index with Rounds Strategy (http://accumulosummit.com/program/talks/accumulo-table-designs/ on page 21). For the same, I couldn't find any documents anywhere. I am wondering if any of you could help me here.
My description of that strategy was mostly just to avoid sending a query to all the servers at once by simply querying one portion of the table at a time. Adding rounds to an existing 'embedded index' example might be the easiest place to start.
The Accumulo O'Reilly book includes an example that starts on page 284 in a section called 'Index Partitioned by Document' whose code lives here: https://github.com/accumulobook/examples/tree/master/src/main/java/com/accumulobook/designs/multitermindex
The query portion of that example is in the class WikipediaQueryMultiterm.java. It uses a BatchScanner configured with a single empty range to send the query to all tablet servers. To implement the by-rounds query strategy this could be replaced with something that goes from one tablet server to the next, either in a round-robin fashion, or perhaps going to 1, then if not enough results are found, going to the next 2, then 4 and so on, to mimic what Cassandra does.
Since you can't target servers directly with a query and since the table is using some partitioning IDs you could configure your scanners to scan all the key values within the first partition ID, then querying the next partition ID and so on, or perhaps visiting the partitions in random order to avoid congestion.
What some others have mentioned, adding additional indexes to help narrow the search space before sending a query to multiple servers hosting an embedded index, is beyond the scope of what I described and is a strategy that I believe is employed by the recently released DataWave project: https://github.com/NationalSecurityAgency/datawave

Can Cassandra statements inside a batch have separate timestamps using cpp driver?

If I use a Cassandra batch statement using CQL, then each statement can have an individual timestamp. For example, something like:
BEGIN BATCH
INSERT INTO users (name, surname) VALUES ('Bob', 'Smith') USING TIMESTAMP 10000001;
DELETE FROM users USING TIMESTAMP 10000000 WHERE user='Bob';
APPLY BATCH;
If I try to do something similar using the C++ driver, I'd do something like this:
Create the batch with cass_batch_new
Create the statements with cass_future_get_prepared then cass_prepared_bind
Set the timestamp on each statement with cass_statement_set_timestamp
Add the statement to the batch using cass_batch_add_statement
Execute the batch using cass_session_execute_batch
I'd then expect this to behave in the same way as the CQL batch statement, in as much as each statement in the batch is executed with its own separate timestamp. But, based on my testing, I've not been able to get this to work. It appears to executed the entire batch using a single timestamp.
Similarly, if I create a monotonic timestamp generator to generate the timestamps for me, it appears to just use a timestamp for the batch and not for the individual statements.
I've taken a look at the source code for the C++ driver and it looks like when it encodes the statements in the batch for sending to the database (in ExecuteRequest::encode_batch), it doesn't attempt to encode a timestamp for each statement in the batch, just for the batch overall. When encoding individual statements not in a batch it does encode the timestamp for the statement (in ExecuteRequest::internal_encode).
As a workaround, instead of setting the timestamp on the statements using cass_statement_set_timestamp, I can put the "USING TIMESTAMP 10000001" directly into the CQL string, and that then works as intended. So, it appears that the database can correctly have separate timestamps on each statement in the batch, but the C++ driver can't send them.
But putting the timestamp directly into the CQL with "USING TIMESTAMP 10000001" then I can't reuse the statement by just binding new values to it. I'd need to prepare the statement again.
Has anyone else tried this and managed to get it to work? Or is it just a known limitation of the C++ driver?
I'm using Cassandra C++ driver version 2.2.2 and database version 2.2.5 which as far as I can tell is using native protocol version 4
I also raised this on the Cassandra C++ driver mailing list Google group and Michael Penick replied to say it's not currently possible. The underlying protocol does not support a timestamp per statement in the batch, so the driver is not able to send one.
Native Protocol v4 spec

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

ActivePivot QueriesService.retrieveObject on a distributed cube

I've been trying to create a new action in ActivePivot Live, that calls retrieveObject on the QueriesService. Something like this:
IQueriesService queriesService = getSessionService(IQueriesService.class);
ObjectDTO dto = queriesService.retrieveObject("MyDistributedCube", action.getObjectKey());
This works fine on a local cube, but in a distributed setup if fails to retrieve an object from a remote server. Maybe this is not surprising, but the question is how do I make it work?
Would a new query type, similar to the LargeDealsQuery in this example help me?
http://support.quartetfs.com/confluence/display/AP4/How+to+Implement+a+Custom+Web+Service
UPDATE:
Here is the context. What I have is too may fields to resonably show in the drill-through blotter, so I'm hiding some in the cube drill-through config, both for display, but also to reduce the amount of data transfered. To see all the fields when that is needed, I added a "drill-through details" item to the right-click menu, that will query the cube for all fields on a single drill-through row and show that in a pop-up. Maybe there is a better way to get this functionality?
IQueriesService.retrieveObject() is an obsolete service that was introduced in ActivePivot 3.x. At that time ActivePivot stored the input objects directly in the memory and it was natural to provide means to retrieve those objects. But later versions of ActivePivot introduced a column store: the data is extracted from the input objects and packed and compressed into columnar structures. The input objects are then released, vastly reducing memory usage.
For ActivePivot 4.x the retrieveObject() service has been somewhat maintained, although indirectly, as in fact generic objects are reconstructed on the fly from the compressed data. As you noticed the implementation only supports local cubes. Only MDX queries and Drillthrough queries have a distributed implementation.
For ActivePivot 5.x the retrieveObject() service has been removed completely, in favor of direct access to the underlying datastore.
There is a good chance you can address your use case with a (distributed) drillthrough query that retrieve raw facts. Another quick fix would be to issue your request manually on each of the local cubes in the cluster.
More generally, drillthrough queries (and also MDX queries, and GetAggregates queries) are contextual in ActivePivot. You can attach *IContextValue*s to the query that will alter the way the query is executed. For drillthrough queries in particular, you can attach the IDrillThroughProperties context value to the query:
public interface IDrillthroughProperties extends IContextValue {
/**
* #return The list of hidden columns defined by their header names.
*/
List<String> getHiddenColumns();
/**
* #return The comparator used to sort drillthrough headers (impacts the column order).
*/
IComparator<String> getHeadersComparator();
/**
* Returns the post-processed columns defined as plugin definitions of {#link IPostProcessedProperty}.
* #return the post-processed columns defines as plugin definitions of {#link IPostProcessedProperty}.
*/
List<IPluginDefinition> getPostProcessedColumns();
#Override
IDrillthroughProperties clone();
}
This will among other things allow you to retrieve only the columns you want for a specific drillthrough query.
According to your update one may do the following:
set the drillthroughProperties not in the shared context but in a given role or per user and allow each user to change that before firing a DT query.
So you have to code a service that will display all the attributes that a user can access, then the user can choose what fields should appear in the drillthourgh, populate the drillthroughProperties then and fire a DT query. You'll see then only what you're interested in.
see this like the currency context of the sandbox but here it impacts the DT.

Resources