Intercept and modify incoming SQL queries to Spark Thrift Server - apache-spark

I have a thrift server up and running, with users sending queries over a JDBC connection. Can I intercept and modify the queries as they come in, and then send the result of the modified query back to the user?
For example - I want the user to be able to send the query
SELECT * FROM table_x WHERE pid="123";
And have the query modified to
SELECT * FROM table_y WHERE pid="123";
and the results of the second query should be returned. This should be transparent to the user.

SparkExecuteStatementOperation and SparkSession is what we thought we would add our code. I am using (yet to go in prod) a simple rule based on some external policy, I change the name of table to a view in the SQL before passing ahead. Its a bit hacky though.

There is no way to change the query in Spark Thrift Server.You can used other way to change the query before your Jdbc/odbc driver.Which takes several operation on it in complex query.You can use string modification in simple query.Only a table name change is easy but Parsing the query and modify the complex query is not easy.

You could use a database proxy to rewrite the queries as needed before they hit the database(s).
I'm not sure if it makes sense in your particular situation, but if it does, take a look at Gallium Data, that's a common use case.

Related

Use RichMap in Flink like Scala MapPartition

In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.
Now I want to do the same thing in Flink. After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job. I will explain my use case which will clarify the situtaion.
Example : I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted. Now this list of users is dynamic and is available in a db. I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users. In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins. But with Flink using RichMap I can do that only in the open function when my job starts.
How can I do the following operation in Flink?
It seems that what You want to do is stream-table join. There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here.
The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.
Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.

Run multiple hive query from pyspark on the same session

I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.

What happens when I "select * from ColumnFamily" in a given Cassandra Cluster

Can someone explain and provide the document that explains the behavior of
select * from <keyspace.table>
Let's assume I have 5 node cluster, how does Cassandra DataStax Driver behave when such queries are being issued?
(Fetchsize was set to 500)
Is this a proper way to pull data ? Does it cause any performance issues?
No, that's really a very bad way to pull data. Cassandra shines when it fetches the data by at least partition key (that identifies a server that holds the actual data). When you are doing the select * from table, request is sent to coordinating node, that will need to pull all data from all servers and send via that coordinating node, overloading it, and most probably lead to the timeout if you have enough data in the cluster.
If you really need to perform full fetch of the data from cluster, it's better to use something like Spark Cassandra Connector that read data by token ranges, fetching the data directly from nodes that are holding the data, and doing this in parallel. You can of course implement the token range scan in Java driver, something like this, but it will require more work on your side, comparing to use of Spark.

Cassandra Prepared Statement and adding new columns

We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables
We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application

Apache Calcite Data Federation Usecase

Just want to check if the Apache Calcite can be used for the use case "Data Federation"(query with multiple databases).
The idea is I have a master query (5 tables) that has tables from one database (say Hive) and 3 tables from another database (say MySQL).
Can I execute master query on multiple database from one JDBC Client interface ?
If this is possible; where the query execution (particularly inter database join) happens?
Also, can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
I read from Calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
I will try to answer. you can as well send questions to the mailing list. dev#calcite.apache.org you are more likely get answer there.
Can I execute master query on multiple database from one JDBC Client interface ? If this is possible; where the query execution (particularly inter database join) happens?
yes, you can. the Inter database join happens in your memory where calcite runs.
Can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
yes, you can. a lot of calcite consumers are doing this way. but you will have to wrap around the calcite rule system, I mean excute
I read from calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
these are the SQL optimisations that the engine does. imagine a groupBy which could have happened on a tiny table but actually specified after joining with a huge table.

Resources