How to use pushdown predicates in Presto connector - presto

I would like to extend Presto's example-http connector to add the relevant predicates to the REST API call made by the connector to minimize data transfer.
I have been spending now quite a few hours looking through the documentation, online posts and the presto source code but I can't figure out which calls are involved and how to go about this.
It's being mentioned in so many places, but I cannot find any simple code samples or internal descriptions. I'm sure I'm missing something obvious here.

After downloading the source code and running it in a debugger I found it to be rather straight forward on one hand, while restrictive on the other.
The straight forward part is that we simply have to override isPushdownFilterSupported and pushdownFilter when implementing the ConnectorMetadata interface.
The restrictive part is that the query planner now believes that the connector can deal with any type and combination of table filters. In my case, I would only want to take care of those, the remote API I'm calling is supporting and have Presto take care of the rest.
It appears the Presto team is fully aware of that, as a) the methods are labeled #Experimental and b) by the respective comments This interface can be replaced with a connector optimizer rule once the engine supports these (#12546). This would be clearly the right approach for my use case.
/**
* Experimental: if true, the engine will invoke pushdownFilter instead of getTableLayouts.
*
* This interface can be replaced with a connector optimizer rule once the engine supports these (#12546).
*/
#Experimental
default boolean isPushdownFilterSupported(ConnectorSession session, ConnectorTableHandle tableHandle)
{
return false;
}
/**
* Experimental: returns table layout that encapsulates the given filter.
*
* This interface can be replaced with a connector optimizer rule once the engine supports these (#12546).
*/
#Experimental
default ConnectorPushdownFilterResult pushdownFilter(ConnectorSession session, ConnectorTableHandle tableHandle, RowExpression filter, Optional<ConnectorTableLayoutHandle> currentLayoutHandle)
{
throw new UnsupportedOperationException();
}

Related

Customize jOOQ dialect to alter the order in which LIMIT and OFFSET are rendered in a statement

I'm using jOOQ to generate queries to run against Athena (AKA PrestoDB/Trino)
To do this, I am using SQLDialects.DEFAULT, and it works because I use very basic query functionalities.
However, jOOQ renders queries like this:
select *
from "Artist"
limit 10
offset 10
God knows why, but the order of limit and offset seem to matter, and the query only works if written with the order swapped:
select *
from "Artist"
offset 10
limit 10
Is there a class I can subclass, to modify the statement render function so that the order of these are swapped? Or any other way of implementing this myself?
A generic solution in jOOQ
There isn't a simple way to change something as fundamental as the SELECT clause order (or any other SELECT clause syntax) so easily in jOOQ, simply, because this was never a requirement for core jOOQ usage, other than supporting fringe SQL dialects. Since the support of a SQL dialect is a lot of work in jOOQ (with all the integration tests, edge cases, etc.) and since market shares of those dialects are low, it has simply never been a priority to improve this in jOOQ.
You may be tempted to think that this is "just" about the order of keywords in this one case. "Only this one case." It never is. It never stops, and the subtle differences in dialects never end. Just look at the jOOQ code base to get an idea of how weirdly different vendors choose to make their dialects. In this particular case, one would think that it seems extremely obvious and simple to make this clause MySQL / PostgreSQL / SQLite compatible, so your best chance is to make a case with the vendor for a feature request. It should be in their own best interest to be more compatible with the market leaders, to facilitate migration.
Workarounds in jOOQ
You can, of course, patch your generated SQL on a low level, e.g. using an ExecuteListener and a simple regex. Whenever you encounter limit (\d+|\?) offset (\d+|\?), just swap the values (and bind values!). This might work reasonably well for top level selects. It's obviously harder if you're using LIMIT .. OFFSET in nested selects, but probably still doable.
Patching jOOQ is always an option. The class you're looking for in jOOQ 3.17 is org.jooq.impl.Limit. It contains all the rendering logic for this clause. If that's your only patch, then it might be possible to upgrade jOOQ. But obviously, patching is a slippery slope, as you may start patching all sorts of clauses, making upgrades impossible.
You can obviously use plain SQL templates for simple cases, e.g. resultQuery("{0} offset {1} limit {2}", actualSelect, val(10), val(10)). This doesn't scale well, but if it's only about 1-2 queries, it might suffice
Using the SQLDialect.DEFAULT
I must warn you, at this point, that the behaviour of SQLDialect.DEFAULT is unspecified. Its main purpose is to produce something when you call QueryPart.toString() on a QueryPart that is not an Attachable, where a better SQLDialect is unavailable. The DEFAULT dialect may change between minor releases (or even patch releases, if there's an important bug in some toString() method), so any implementation you base on this is at risk of breaking with every upgrade.
The most viable long term solution
... would be to have support for these dialects in jOOQ:
#5414 Presto
#11485 Trino

JOOQ vs SQL Queries

I am on jooq queries now...I feel the SQL queries looks more readable and maintainable and why we need to use JOOQ instead of using native SQL queries.
Can someone explains few reason for using the same?
Thanks.
Here are the top value propositions that you will never get with native (string based) SQL:
Dynamic SQL is what jOOQ is really really good at. You can compose the most complex queries dynamically based on user input, configuration, etc. and still be sure that the query will run correctly.
An often underestimated effect of dynamic SQL is the fact that you will be able to think of SQL as an algebra, because instead of writing difficult to compose native SQL syntax (with all the keywords, and weird parenthesis rules, etc.), you can think in terms of expression trees, because you're effectively building an expression tree for your queries. Not only will this allow you to implement more sophisticated features, such as SQL transformation for multi tenancy or row level security, but every day things like transforming a set of values into a SQL set operation
Vendor agnosticity. As soon as you have to support more than one SQL dialect, writing SQL manually is close to impossible because of the many subtle differences in dialects. The jOOQ documentation illustrates this e.g. with the LIMIT clause. Once this is a problem you have, you have to use either JPA (much restricted query language: JPQL) or jOOQ (almost no limitations with respect to SQL usage).
Type safety. Now, you will get type safety when you write views and stored procedures as well, but very often, you want to run ad-hoc queries from Java, and there is no guarantee about table names, column names, column data types, or syntax correctness when you do SQL in a string based fashion, e.g. using JDBC or JdbcTemplate, etc. By the way: jOOQ encourages you to use as many views and stored procedures as you want. They fit perfectly in the jOOQ paradigm.
Code generation. Which leads to more type safety. Your database schema becomes part of your client code. Your client code no longer compiles when your queries are incorrect. Imagine someone renaming a column and forgetting to refactor the 20 queries that use it. IDEs only provide some degree of safety when writing the query for the first time, they don't help you when you refactor your schema. With jOOQ, your build fails and you can fix the problem long before you go into production.
Documentation. The generated code also acts as documentation for your schema. Comments on your tables, columns turn into Javadoc, which you can introspect in your client language, without the need for looking them up in the server.
Data type bindings are very easy with jOOQ. Imagine using a library of 100s of stored procedures. Not only will you be able to access them type safely (through code generation), as if they were actual Java code, but you don't have to worry about the tedious and useless activity of binding each single in and out parameter to a type and value.
There are a ton of more advanced features derived from the above, such as:
The availability of a parser and by consequence the possibility of translating SQL.
Schema management tools, such as diffing two schema versions
Basic ActiveRecord support, including some nice things like optimistic locking.
Synthetic SQL features like type safe implicit JOIN
Query By Example.
A nice integration in Java streams or reactive streams.
Some more advanced SQL transformations (this is work in progress).
Export and import functionality
Simple JDBC mocking functionality, including a file based database mock.
Diagnostics
And, if you occasionally think something is much simpler to do with plain native SQL, then just:
Use plain native SQL, also in jOOQ
Disclaimer: As I work for the vendor, I'm obviously biased.

Google Datastore filter with OR condition

I am working with NodeJS on Google App Engine with the Datastore database.
I am using composite query filter and just need a basic "OR" condition.
Example: Query Tasks that have Done = false OR priority = 4
const query = datastore.createQuery('Task')
.filter('done', '=', false) //How to make this an OR condition?
.filter('priority', '=', 4);
However, according to the documentation:
Cloud Datastore currently only natively supports combining filters
with the AND operator.
What is a good way to achieve a basic OR condition without running two entirely separate queries and then combining the results?
UPDATE
I have my solution described in detail here in my other post. Any feedback for improvements to the solution would be appreciated since I'm still learning NodeJS.
Not currently possible to achieve a query with an OR condition - this is what the note you quoted means.
Some client libraries provide some (limited) support for OR-like operations. From Restrictions on queries:
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore.
But AFAIK no such library is available for NodeJS.
If you only have a need for a few specific such queries one possible approach would be to compute (at the time of writing the entities) an additional property with the desired result for such query and use equality queries on that property instead.
For example, assuming you'd like a query with OR-ing the equivalents of these filters:
.filter('status', '=', 'queued')
.filter('status', '=', 'running')
You could compute a property like not_done every time status changes and set it to true if status is either queued or running and false otherwise. Then you can use .filter('not_done', '=', true) which would have the same semantics. Granted, it's not convenient, but it may get you past the hurdle.
I wrote an answer on your other question, regarding using Array properties on Cloud Datastore to work around some cases where having the OR operator would have helped: https://stackoverflow.com/a/74958631/963901

Google Datastore query filter for multiple values for same property

I have a query I wish to run on google Datastore that is intended to retrieve data from multiple devices. However, I couldn't find anywhere in the documentation that would allow me to get data from e.g. device-1 or device-2 or device-3, i.e. only 1 property name can be set. Is this a Datastore limitation? Or am I just missing something that I don't know about?
Based on the NodeJS client library, the query might look something like the below filter criteria:
var query = datastore.createQuery('data')
.filter('device_id', 1)
.filter('device_id', 2)
.filter('device_id', 3);
Otherwise, I might have to run separate queries for the various devices, which doesn't seem like a very elegant solution, especially if there are a lot of devices to simultaneously run queries on.
Any suggestions for the Datastore API or alternative approaches are welcome!
Yes, this would be an OR operation which is one of the Restrictions on queries (emphasis mine):
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore. Additionally:

how to extend apache spark api?

i've been tasked with figuring out how to extend spark's api to include some custom hooks for another program like iPython Notebook to latch on to. I've already gone through the quick start guide, the cluster mode overview, the submitting applications doc, and this stack overflow question. Everything I'm seeing indicates that, to get something to run in Spark, you need to use
spark-submit
to make it happen. As such I whipped up some code that, visa vis spark, pulled the first ten rows of test data out of an accumulo table I created. My team lead, however, is telling me to modify spark itself. Is this the preferred way to accomplish the task I described? If so, why? What's the value proposition?
No details have been provided about what types of operations your application requires so an answer here will need to remain general in nature.
The question of extending spark itself may come down to:
Can I achieve the needs of the application by leveraging the existing
methods within Spark(/SQL/Hive/Streaming)Context and RDD
(/SchemaRDD/DStream/..)
additional choices:
Is it possible to embed the required functionality inside the
transformation methods of RDD's - either by custom code or by invoking
third party libraries.
The likely distinguishing factors here would be if the existing data access and shuffle/distribution structures support your needs. When it comes to data tranformations - in most cases you should be able to embed the required logic within the methods of RDD.
So:
case class InputRecord(..)
case class OutputRecord(..)
def myTranformationLogic(inputRec: InputRecord) : OutputRecord = {
// put your biz rules/transforms here
(return) outputRec
}
val myData = sc.textFile(<hdfs path>).map{ l => InputRecord.fromInputLine(l)}
val outputData = myData.map(myTransformationLogic)
outputData.saveAsTextFile(<hdfs path>)

Resources