Is there any way to find out which node has been used by SELECT statement in Cassandra? - cassandra

I have written a custom LoadBalancerPolicy for spark-cassandra-connector and now I want to ensure that it really works!
I have a Cassandra cluster with 3 nodes and a keyspace with a replication factor of 2, so when we want to retrieve a record, there will be only two nodes on cassandra which hold the data.
The thing is that I want to ensure the spark-cassandra-connector (with my load-balancer-policy) is still token-aware and will choose the right node as coordinator for each "SELECT" statement.
Now, I'm thinking if we can write a trigger on the SELECT statement for each node, in case of the node does not hold the data, the trigger will create a log and I realize the load-balancer-policy does not work properly. How can we write a trigger On SELECT in Cassandra? Is there any better way to accomplish that?
I already checked the documentation for creating the triggers and those are too limited:
Official documentation
Documentation at DataStax
Example implementation in official repo

You can do it from the program side, if you get routing key for your bound statement (you must use prepared statements), find the replicas for it via Metadata class, and then compare if this host is in the ExecutionInfo that you can get from ResultSet.

According to what Alex said, we can do it as below:
After creating SparkSession, we should make a connector:
import com.datastax.spark.connector.cql.CassandraConnector
val connector = CassandraConnector.apply(sparkSession.sparkContext.getConf)
Now we can define a preparedStatement and do the rest:
connector.withSessionDo(session => {
val selectQuery = "select * from test where id=?"
val prepareStatement = session.prepare(selectQuery)
val protocolVersion = session.getCluster.getConfiguration.getProtocolOptions.getProtocolVersion
// We have to explicitly bind the all of parameters that partition key is based on them, otherwise the routingKey will be null.
val boundStatement = prepareStatement.bind(s"$id")
val routingKey = boundStatement.getRoutingKey(protocolVersion, null)
// We can get tha all of nodes that contains the row
val replicas = session.getCluster.getMetadata.getReplicas("test", routingKey)
val resultSet = session.execute(boundStatement)
// We can get the node which gave us the row
val host = resultSet.getExecutionInfo.getQueriedHost
// Final step is to check whether the replicas contains the host or not!!!
if (replicas.contains(host)) println("It works!")
})
The important thing is that we have to explicitly bind the all of parameters that partition key is based on them (i.e. we cannot set them har-codded in the SELECT statement), otherwise the routingKey will be null.

Related

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Cassandra Statement set KeySpace

Using Cassandra 2.2.8 with 3.0 Connector.
I am trying to create a Statement with QueryBuilder. When I execute Statement it complains no keyspace defined. The only way I know to set keyspace is as below (There is no setKeyspace method in Statement). When I do a getKeySpace - I actually get null
Statement s = QueryBuilder.select().all()
.from("test.tests")
System.out.println("getKeyspace:"+ s.getKeyspace()); >> null
Am I doing something wrong, Is there any other (more reliable) way to setKeyspace?
Thanks
from(String) expects a table name. While what you are doing is technically valid and cassandra will interpret it correctly, the driver is not able to derive the keyspace name in this way.
Instead you could use from(String, String) which takes the first parameter as the keyspace.
Statement s = QueryBuilder.select().all()
.from("test", "tests");
System.out.println("getKeyspace:" + s.getKeyspace()); // >> test

Query Spark SQL from Node.js server

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.
I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

Cassandra Prepared Statement - Binding Parameters Twice

I have a cql query I want to preform. The cql string looks like this:
SELECT * FROM :columnFamilyName WHERE <some_column_name> = :name AND <some_id> = :id;
My application has two layers of abstraction above the datastax driver. In one layer I want to bind the first two parameters and in another layer I'd like to bind the last parameter.
The problem is, if I bind the first two parameters, I get a BoundStatement to which I cannot bind another parameter. Am I missing something? Can it be done?
We're using datastax driver version 2.0.3.
Thanks,
Anatoly.
You should be able to bind any number of parameters to your BoundStatement using boundStatement.setXXXX(index,value) as follows :
BoundStatement statement = new BoundStatement(query);
statement.setString(0, "value");
statement.setInt(1, 1);
statement.setDate(2, new Date());
ResultSet results = session.execute(statement);
The problem though is that you're trying to use a dynamic column family whose value changes with the value you want to bind.
As far as I know, this is not allowed so you should instead prepare one statement per table and then use the right bound statement.

How to check if a Cassandra table exists

Is there an easy way to check if table (column family) is defined in Cassandra using CQL (or API perhaps, using com.datastax.driver)?
Right now I am leaning towards executing SELECT 1 FROM table and checking for exception but maybe there is a better way?
As of 1.1 you should be able to query the system keyspace, schema_columnfamilies column family. If you know which keyspace you want to check, this CQL should list all column families in a keyspace:
SELECT columnfamily_name
FROM schema_columnfamilies WHERE keyspace_name='myKeyspaceName';
The report describing this functionality is here: https://issues.apache.org/jira/browse/CASSANDRA-2477
Although, they do note that some of the system column names have changed between 1.1 and 1.2. So you might have to mess around with it a little to get your desired results.
Edit 20160523 - Cassandra 3.x Update:
Note that for Cassandra 3.0 and up, you'll need to make a few adjustments to the above query:
SELECT table_name
FROM system_schema.tables WHERE keyspace_name='myKeyspaceName';
The Java driver (since you mentioned it in your question) also maintains a local representation of the schema.
Driver 3.x and below:
KeyspaceMetadata ks = cluster.getMetadata().getKeyspace("myKeyspace");
TableMetadata table = ks.getTable("myTable");
boolean tableExists = (table != null);
Driver 4.x and above:
Metadata metadata = session.getMetadata();
boolean tableExists =
metadata.getKeyspace("myKeyspace")
.flatMap(ks -> ks.getTable("myTable"))
.isPresent();
I just needed to manually check for the existence of a table using cqlsh.
Possibly useful general info.
describe keyspace_name.table_name
If it doesn't exist you'll get 'table_name' not found in keyspace 'keyspace'
If it does exist you'll get a description of the table.
For the .NET driver CassandraCSharpDriver version 3.17.1 the following code creates a table if it doesn't exist yet:
var ks = _cassandraSession.Cluster.Metadata.GetKeyspace(keyspaceName);
var tableNames = ks.GetTablesNames();
if(!tableNames.Contains(tableName.ToLowerInvariant()))
{
var stmt = new SimpleStatement($"CREATE TABLE {tableName} (id text PRIMARY KEY, name text, price decimal, volume int, time timestamp)");
_cassandraSession.Execute(stmt);
}
You will need to adapt the list of table columns to your needs. This can also be awaited by using await _cassandraSession.ExecuteAsync(stmt).ConfigureAwait(false) in an async method.
Also, I want to mention that I'm using Cassandra version 4.0.1.

Resources