cassandra uuid in SELECT statements with Spark SQL - apache-spark

I have the following table in cassandra (v. 2.2.3)
cqlsh> DESCRIBE TABLE historian.timelines;
CREATE TABLE historian.timelines (
assetid uuid,
tslice int,
...
value map<text, text>,
PRIMARY KEY ((assetid, tslice), ...)
) WITH CLUSTERING ORDER BY (deviceid ASC, paramid ASC, fts DESC)
...
;
And I want to extract the data through Apache Spark (v. 1.5.0) via the following java snippet (using the cassandra spark connector v. 1.5.0 and cassandra driver core v. 2.2.0 RC3):
// Initialize Spark SQL Context
CassandraSQLContext sqlContext = new CassandraSQLContext(jsc.sc());
sqlContext.setKeyspace(keyspace);
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = '085eb9c6-8a16-11e5-af63-feff819cdc9f' LIMIT 2");
df.show();
At this point I get the following error accessing show method above:
cannot resolve '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' due to data type mismatch:
differing types in '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' (uuid and double).;
So it seems that Spark SQL is not interpreting the assetid input as an UUID. What I could do to handle cassandra UUID type in Spark SQL queries?
Thanks!

Indeed your query parameter is a String not a UUID, simply convert the query param like this :
import java.util.UUID;
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = "+ UUID.fromString("085eb9c6-8a16-11e5-af63-feff819cdc9f") +" LIMIT 2");

Related

Spark SQL - Custom Datatype UUID

i am trying to convert the Column in the Dataset from varchar to UUID using the custom datatype in Spark SQL. But i see the conversion not happening. Please let me know if i am missing anything here.
val secdf = sc.parallelize( Array(("85d8b889-c793-4f23-93e9-ea18db640039","Revenue"), ("85d8b889-c793-4f23-93e9-ea18db640038","Income:123213"))).toDF("id", "report")
val metadataBuilder = new MetadataBuilder()
metadataBuilder.putString("database.column.type", "uuid")
metadataBuilder.putLong("jdbc.type", java.sql.Types.OTHER)
val metadata = metadataBuilder.build()
val secReportDF = secdf.withColumn("id", col("id").as("id", metadata))
i did the work around as we are not able to cast to UUID in Spark SQL, i have added the property in the Postgres JDBC client as stringtype=unspecified which solved my issue in Inserting UUID through Spark JDBC

Partition key predicate must include all partition key columns

I'm trying to create query in spark using scala language, the data is available in cassandra database as a table. In Cassandra table i have two keys, 1) Primary Key
2) Partition Key
Cassandra DDL will be something like this:
CREATE TABLE A.B (
id1 text,
id2 text,
timing timestamp,
value float,
PRIMARY KEY ((id1, id2), timing)
) WITH CLUSTERING ORDER BY (timing DESC)
My Spark Programming:
val conf = new SparkConf(true).set("spark.cassandra.connection.host","192.168.xx.xxx").set("spark.cassandra.auth.username","test").set("spark.cassandra.auth.password","test")
val sc = new SparkContext(conf)
var ctable = sc.cassandraTable("A", "B").select("id1","id2","timing","value").where("id1=?","1001")
When i query the same for "value" I'm obtaining the result, but when i query for id1 or id2 i'm receiving an error.
Error Obtained:
java.lang.UnsupportedOperationException: Partition key predicate must include all partition key columns or partition key columns need to be indexed. Missing columns: id2
I'm Using spark-2.2.0-bin-hadoop2.7, Cassandra 3.9, scala 2.11.8.
Thanks in advance.
The Output i required was obtained by using following program.
val conf = new SparkConf(true).set("spark.cassandra.connection.host","192.168.xx.xxx").set("spark.cassandra.auth.username","test").set("spark.cassandra.auth.password","test")
val sc = new SparkContext(conf)
var ctable = sc.cassandraTable("A", "B").select("id1","id2","timing","value").where("id1=?","1001").where("id2=?","1002")
This is how we can access to partition key in cassandra database through Spark.

Spark query appends keyspace to the spark temp table

I have a cassandraSQLContext where I do this:
cassandraSqlContext.setKeyspace("test");
Because if I don't, it complains about me setting up the default keyspace.
Now when I run this piece of code:
def insertIntoCassandra(siteMetaData: MetaData, dataFrame: DataFrame): Unit ={
System.out.println(dataFrame.show())
val tableName = siteMetaData.getTableName.toLowerCase()
dataFrame.registerTempTable("spark_"+ tableName)
System.out.println("Registered the spark table to spark_" + tableName)
val columns = columnMap.get(siteMetaData.getTableName)
val query = cassandraQueryBuilder.buildInsertQuery("test", tableName, columns)
System.out.println("Query: " + query);
cassandraSqlContext.sql(query)
System.out.println("Query executed")
}
It gives me this error log:
Registered the spark table to spark_test
Query: INSERT INTO TABLE test.tablename SELECT **the columns here** FROM spark_tablename
17/02/28 04:15:53 ERROR JobScheduler: Error running job streaming job 1488255351000 ms.0
java.util.concurrent.ExecutionException: java.io.IOException: Couldn't find test.tablename or any similarly named keyspace and table pairs
What I don't understand is why isnt cassandraSQLContext executing the printed out query, why does it append the keyspace to the spark temptable.
public String buildInsertQuery(String activeReplicaKeySpace, String tableName, String columns){
String sql = "INSERT INTO TABLE " + activeReplicaKeySpace + "." + tableName +
" SELECT " + columns + " FROM spark_" + tableName;
return sql;
}
So the problem was I was using two different instances of cassandraSQLContext.In one of the methods, I was instantiating a new cassandraSQLContext which was conflicting with the cassandraSQLContext being passed to insertIntoCassandra method.

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

How to use saveTOCassandra()

I am new to spark I want to save my spark data to cassandra with a condition that I have an RDD and I want to save data of this RDD into more he one table in cassandra?Is this possible if yes then how ?
Use the Spark-Cassandra Connector.
How to save data to cassandra: example from the docs:
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
See the project and full documentation here: https://github.com/datastax/spark-cassandra-connector
Python pyspark Cassandra saveToCassandra Spark
Imagine your table is the following:
CREATE TABLE ks.test (
id uuid,
sampleId text,
validated boolean,
cell text,
gene text,
state varchar,
data bigint, PRIMARY KEY (id, sampleId) );
How you can update only the 'validated' field for a given sampleId in the test table in the keyspace ks ? You can use the following line to update the table in Python.
from pyspark import SparkConf
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext
conf = SparkConf().set("spark.cassandra.connection.host", <IP1>).set("spark.cassandra.connection.native.port",<IP2>)
sparkContext = CassandraSparkContext(conf = conf)
rdd = sparkContext.parallelize([{"validated":False, "sampleId":"323112121", "id":"121224235-11e5-9023-23789786ess" }])
rdd.saveToCassandra("ks", "test", {"validated", "sample_id", "id"} )

Resources