Run multiple hive query from pyspark on the same session

Run multiple hive query from pyspark on the same session - python-3.x

I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?

this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.

Related

Spark Temptable vs Broadcasting

I have a question/opinion that needs experts suggestion.
I have a table called config that contains some configuration information as the table name suggests. I need this details to be accessed from all the executors during my job's life cycle. So my first option is Broadcasting them in List[Case Class] .But suddenly got an idea of making the config as Temptable using registerTempTable() and use it accross my job.
This temp table approach can be used an alternative to Broadcast variables ( I have extensive hands-on on Broadcasting)?

registerTempTable does just give you the possibilty to run plain sql queries on your dataframe, there is no performance benefit/caching/materialization involved.
You should go with broadcasting (I would suggest to use a Map for configuration parameters)

registerTempTable() then using it for lookup, will mostly use the broadcast join only internally given the scenario the table/config file size < 10MB.

Intercept and modify incoming SQL queries to Spark Thrift Server

I have a thrift server up and running, with users sending queries over a JDBC connection. Can I intercept and modify the queries as they come in, and then send the result of the modified query back to the user?
For example - I want the user to be able to send the query
SELECT * FROM table_x WHERE pid="123";
And have the query modified to
SELECT * FROM table_y WHERE pid="123";
and the results of the second query should be returned. This should be transparent to the user.

SparkExecuteStatementOperation and SparkSession is what we thought we would add our code. I am using (yet to go in prod) a simple rule based on some external policy, I change the name of table to a view in the SQL before passing ahead. Its a bit hacky though.

There is no way to change the query in Spark Thrift Server.You can used other way to change the query before your Jdbc/odbc driver.Which takes several operation on it in complex query.You can use string modification in simple query.Only a table name change is easy but Parsing the query and modify the complex query is not easy.

You could use a database proxy to rewrite the queries as needed before they hit the database(s).
I'm not sure if it makes sense in your particular situation, but if it does, take a look at Gallium Data, that's a common use case.

session.execute() not reflected on cassandra when done on spark cluster

I am running a spark job where some data is loaded from cassandra table. From that data, I make some insert and delete statements.
and execute them. (using forEach)
boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
boolean insertStatus = connector.openSession().execute(insert).wasApplied();
System.out.println(delete+":"+deleteStatus);
System.out.println(insert+":"+insertStatus);
When i run it locally, i see the respective results in the table.
However, when I run it on a cluster, sometimes the result is displayed and sometime the changes don't take place.
I saw the stdout from web-ui of spark, and the query along with true was printed for both the queries.(Data was loaded correctly. But sometimes, only insert is reflected, sometimes only delete, sometimes both, and most of the times none.)
Specifications:
spark slaves on same machines as the cassandra nodes.(each node has two instances of slaves.)
spark master on a separate machine.
Repair done on all nodes.
Cassandra restarted

boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
boolean insertStatus = connector.openSession().execute(insert).wasApplied();
This is a known anti-pattern, you create a new Session object for each query, which is extremely expensive.
Just create the session once and re-use it for all the queries.
To see which queries are being executed and sent to Cassandra, use the slow query logger feature as a hack: http://datastax.github.io/java-driver/manual/logging/#logging-query-latencies
The idea is to set the threshold to a ridiculously low value so that every query will be considered slow and displayed in the log.
You should use this hack only for testing of course

Cassandra Prepared Statement and adding new columns

We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables

We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...

i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio

You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string