Query Spark SQL from Node.js server - node.js

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.

I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

Related

Spark + EMRFS/S3 - Is there a way to read client side encrypted data and write it back using server side encryption?

I have a use-case in spark where I have to read data from a S3 that uses client-side encryption, process it and write it back using only server-side encryption. I'm wondering if there's a way to do this in spark?
Currently, I have these options set:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3.enableServerSideEncryption=true
spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=<kms id here>
But obviously, it's ending up using both CSE and SSE while writing the data. So, I'm wondering it it's possible to somehow only set spark.hadoop.fs.s3.cse.enabled to true while reading and then set it to false or maybe another alternative.
Thanks for the help.
Using programmatic configuration to define multiple S3 filesystems:
spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3sse.impl=foo.bar.S3SseFilesystem
and then add a custom implementation for s3sse:
package foo.bar
import java.net.URI
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.s3a.S3AFileSystem
class S3SseFilesystem extends S3AFileSystem {
override def initialize(name: URI, originalConf: Configuration): Unit = {
val conf = new Configuration()
// NOTE: no prefix spark.hadoop here
conf.set("fs.s3.enableServerSideEncryption", "true")
conf.set("fs.s3.serverSideEncryption.kms.keyId", "<kms id here>")
super.initialize(name, conf)
}
}
After this, the custom file system can be used with Spark read method
spark.read.json("s3sse://bucket/prefix")

Datastax Node.js Cassandra driver When to use a Mapper vs. Query

I'm working with the Datastax Node.js driver and I can't figure out when to use a mapper vs. query. Both seem to be able to perform the same CRUD operations.
With a query:
const q = SELECT * FROM mykeyspace.mytable WHERE id='12345';
client.execute(q).then(result => console.log('This is the data', result);
With mapper:
const tableRow = await tableMapper.find({ id: '12345' });
When should I use the mapper over a query and vice versa?
Mapper is a feature from cassandra-driver released in 2018. Using mapper, cassandra-driver can make a map from your cassandra table to an object in nodejs and you can handle in your nodejs application like a set of document.
Using mapper you can make selects or inserts in your database like said in this article:
https://www.datastax.com/blog/2018/12/introducing-datastax-nodejs-mapper-apache-cassandra
With query method, if you need to use or reuse any property from your json you will need to make a Json.Parse().
The short answer is: whatever you find more comfortable.
The Mapper lets you deal with database data as documents (JavaScript objects), builds the CQL query for you, executes the query and maps the results.
On the other hand, the core driver only supports executing CQL queries that you have to write yourself.

Is there any way to find out which node has been used by SELECT statement in Cassandra?

I have written a custom LoadBalancerPolicy for spark-cassandra-connector and now I want to ensure that it really works!
I have a Cassandra cluster with 3 nodes and a keyspace with a replication factor of 2, so when we want to retrieve a record, there will be only two nodes on cassandra which hold the data.
The thing is that I want to ensure the spark-cassandra-connector (with my load-balancer-policy) is still token-aware and will choose the right node as coordinator for each "SELECT" statement.
Now, I'm thinking if we can write a trigger on the SELECT statement for each node, in case of the node does not hold the data, the trigger will create a log and I realize the load-balancer-policy does not work properly. How can we write a trigger On SELECT in Cassandra? Is there any better way to accomplish that?
I already checked the documentation for creating the triggers and those are too limited:
Official documentation
Documentation at DataStax
Example implementation in official repo
You can do it from the program side, if you get routing key for your bound statement (you must use prepared statements), find the replicas for it via Metadata class, and then compare if this host is in the ExecutionInfo that you can get from ResultSet.
According to what Alex said, we can do it as below:
After creating SparkSession, we should make a connector:
import com.datastax.spark.connector.cql.CassandraConnector
val connector = CassandraConnector.apply(sparkSession.sparkContext.getConf)
Now we can define a preparedStatement and do the rest:
connector.withSessionDo(session => {
val selectQuery = "select * from test where id=?"
val prepareStatement = session.prepare(selectQuery)
val protocolVersion = session.getCluster.getConfiguration.getProtocolOptions.getProtocolVersion
// We have to explicitly bind the all of parameters that partition key is based on them, otherwise the routingKey will be null.
val boundStatement = prepareStatement.bind(s"$id")
val routingKey = boundStatement.getRoutingKey(protocolVersion, null)
// We can get tha all of nodes that contains the row
val replicas = session.getCluster.getMetadata.getReplicas("test", routingKey)
val resultSet = session.execute(boundStatement)
// We can get the node which gave us the row
val host = resultSet.getExecutionInfo.getQueriedHost
// Final step is to check whether the replicas contains the host or not!!!
if (replicas.contains(host)) println("It works!")
})
The important thing is that we have to explicitly bind the all of parameters that partition key is based on them (i.e. we cannot set them har-codded in the SELECT statement), otherwise the routingKey will be null.

How to work with temporary tables in foreachBatch?

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.
tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.
Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

Saving / exporting transformed DataFrame back to JDBC / MySQL

I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible.
A trivial example of what I'm trying looks like this:
sqlContext.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar")
).select("some_column", "another_column")
.write.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar2")
).save("foo.bar2")
This doesn't work — I end up with this error:
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
I'm not sure if I'm doing something wrong (why is it resolving to DefaultSource instead of JDBCRDD for example?) or if writing to an existing MySQL database just isn't possible using Spark's DataFrames API.
Update
Current Spark version (2.0 or later) supports table creation on write.
The original answer
It is possible to write to an existing table but it looks like at this moment (Spark 1.5.0) creating table using JDBC data source is not supported yet*. You can check SPARK-7646 for reference.
If table already exists you can simply use DataFrameWriter.jdbc method:
val prop: java.util.Properties = ???
df.write.jdbc("jdbc:mysql://localhost/foo", "foo.bar2", prop)
* What is interesting PySpark seems to support table creation using jdbc method.

Resources