I'm trying to create query in spark using scala language, the data is available in cassandra database as a table. In Cassandra table i have two keys, 1) Primary Key
2) Partition Key
Cassandra DDL will be something like this:
CREATE TABLE A.B (
id1 text,
id2 text,
timing timestamp,
value float,
PRIMARY KEY ((id1, id2), timing)
) WITH CLUSTERING ORDER BY (timing DESC)
My Spark Programming:
val conf = new SparkConf(true).set("spark.cassandra.connection.host","192.168.xx.xxx").set("spark.cassandra.auth.username","test").set("spark.cassandra.auth.password","test")
val sc = new SparkContext(conf)
var ctable = sc.cassandraTable("A", "B").select("id1","id2","timing","value").where("id1=?","1001")
When i query the same for "value" I'm obtaining the result, but when i query for id1 or id2 i'm receiving an error.
Error Obtained:
java.lang.UnsupportedOperationException: Partition key predicate must include all partition key columns or partition key columns need to be indexed. Missing columns: id2
I'm Using spark-2.2.0-bin-hadoop2.7, Cassandra 3.9, scala 2.11.8.
Thanks in advance.
The Output i required was obtained by using following program.
val conf = new SparkConf(true).set("spark.cassandra.connection.host","192.168.xx.xxx").set("spark.cassandra.auth.username","test").set("spark.cassandra.auth.password","test")
val sc = new SparkContext(conf)
var ctable = sc.cassandraTable("A", "B").select("id1","id2","timing","value").where("id1=?","1001").where("id2=?","1002")
This is how we can access to partition key in cassandra database through Spark.
Related
I am trying to merge the incremental data with an existing hive table.
For testing I created a dummy table from the base table as below:
create base.dummytable like base.fact_table
The table: base.fact_table is partition based on dbsource String
When I checked the dummy table's DDL, I could see that the partition column is correctly defined.
PARTITIONED BY ( |
| `dbsource` string)
Then I tried to exchange one of the partition from the dummy table by dropping it first.
spark.sql("alter table base.dummy drop partition(dbsource='NEO4J')")
The partition: NEO4J has dropped successfully and I ran the exchange statement as below:
spark.sql("ALTER TABLE base.dummy EXCHANGE PARTITION (dbsource = 'NEO4J') WITH TABLE stg.inc_labels_neo4jdata")
The exchange statement is giving an error:
Error: Error while compiling statement: FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {dbsource=NEO4J}
The table I am trying to push the incremental data is partitioned by dbsource and I have dropped it successfully.
I am running this from spark code and the config is given below:
val conf = new SparkConf().setAppName("MERGER").set("spark.executor.heartbeatInterval", "120s")
.set("spark.network.timeout", "12000s")
.set("spark.sql.inMemoryColumnarStorage.compressed", "true")
.set("spark.shuffle.compress", "true")
.set("spark.shuffle.spill.compress", "true")
.set("spark.sql.orc.filterPushdown", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max", "512m")
.set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName)
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "true")
.set("spark.executor.instances", "4")
.set("spark.executor.memory", "4g")
.set("spark.executor.cores", "5")
.set("hive.merge.sparkfiles","true")
.set("hive.merge.mapfiles","true")
.set("hive.merge.mapredfiles","true")
show create table base.dummy:
CREATE TABLE `base`.`dummy`(
`dff_id` bigint,
`dff_context_id` bigint,
`descriptive_flexfield_name` string,
`model_table_name` string)
PARTITIONED BY (`dbsource` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'/apps/hive/warehouse/base.db/dummy'
TBLPROPERTIES (
'orc.compress'='ZLIB')
show create table stg.inc_labels_neo4jdata:
CREATE TABLE `stg`.`inc_labels_neo4jdata`(
`dff_id` bigint,
`dff_context_id` bigint,
`descriptive_flexfield_name` string,
`model_table_name` string)
`dbsource` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'/apps/hive/warehouse/stg.db/inc_labels_neo4jdata'
TBLPROPERTIES (
'orc.compress'='ZLIB')
Could anyone let me know what the mistake I am doing here & what should I change inorder to successfully exchange the partition ?
My take on this error is that table stg.inc_labels_neo4jdata is not partitioned as base.dummy and therefore there's no partition to move.
From Hive documentation:
This statement lets you move the data in a partition from a table to
another table that has the same schema and does not already have that
partition.
You can check the Hive DDL Manual for EXCHANGE PARTITION
And the JIRA where this feature was added to Hive. You can read:
This only works if and have the
same field schemas and the same partition by parameters. If they do not
the command will throw an exception.
You basically need to have exactly the same schema on both source_table and destination_table.
Per your last edit, this is not the case.
I want to migrate our old Cassandra cluster to a new one.
Requirements:-
I have a cassandra cluster of 10 nodes and the table i want to migrate is ~100GB. I am using spark for migrating the data. My spark cluster has 10 nodes and each node has around 16GB memory.
In the table we have some junk data which i don't want to migrate to the new table. eg:- Let's say i don't want to transfer the rows which has the cid = 1234. So, what is the best way to migrate this using spark job ? I can't put a where filtering on the cassandraRdd directly as the cid is not the only column included in partitioned key.
Cassandra Table:-
test_table (
cid text,
uid text,
key text,
value map<text, timestamp>,
PRIMARY KEY ((cid, uid), key)
)
Sample Data:-
cid | uid | key | value
------+--------------------+-----------+-------------------------------------------------------------------------
1234 | 899800070709709707 | testkey1 | {'8888': '2017-10-22 03:26:09+0000'}
6543 | 097079707970709770 | testkey2 | {'9999': '2017-10-20 11:08:45+0000', '1111': '2017-10-20 15:31:46+0000'}
I am thinking of something like below. But i guess this is not the best efficient approach.
val filteredRdd = rdd.filter { row => row.getString("cid") != "1234" }
filteredRdd.saveToCassandra(KEYSPACE_NAME,NEW_TABLE_NAME)
What will be the best possible approach here ?
That method is pretty good. You may want to write it in DataFrames to take advantage of the row encoding but this may only have a slight benefit. The key bottleneck in this operation will be writing and reading from Cassandra.
DF Example
spark
.read
.format("org.apache.spark.sql.cassandra")
.option("keyspace", ks)
.option("table", table)
.load
.filter( 'cid !== "1234" )
.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", ks2)
.option("table", table2)
.save
Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.
I have the following table in cassandra (v. 2.2.3)
cqlsh> DESCRIBE TABLE historian.timelines;
CREATE TABLE historian.timelines (
assetid uuid,
tslice int,
...
value map<text, text>,
PRIMARY KEY ((assetid, tslice), ...)
) WITH CLUSTERING ORDER BY (deviceid ASC, paramid ASC, fts DESC)
...
;
And I want to extract the data through Apache Spark (v. 1.5.0) via the following java snippet (using the cassandra spark connector v. 1.5.0 and cassandra driver core v. 2.2.0 RC3):
// Initialize Spark SQL Context
CassandraSQLContext sqlContext = new CassandraSQLContext(jsc.sc());
sqlContext.setKeyspace(keyspace);
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = '085eb9c6-8a16-11e5-af63-feff819cdc9f' LIMIT 2");
df.show();
At this point I get the following error accessing show method above:
cannot resolve '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' due to data type mismatch:
differing types in '(assetid = cast(085eb9c6-8a16-11e5-af63-feff819cdc9f as double))' (uuid and double).;
So it seems that Spark SQL is not interpreting the assetid input as an UUID. What I could do to handle cassandra UUID type in Spark SQL queries?
Thanks!
Indeed your query parameter is a String not a UUID, simply convert the query param like this :
import java.util.UUID;
DataFrame df = sqlContext.sql("SELECT * FROM " + tableName +
" WHERE assetid = "+ UUID.fromString("085eb9c6-8a16-11e5-af63-feff819cdc9f") +" LIMIT 2");
I am new to spark I want to save my spark data to cassandra with a condition that I have an RDD and I want to save data of this RDD into more he one table in cassandra?Is this possible if yes then how ?
Use the Spark-Cassandra Connector.
How to save data to cassandra: example from the docs:
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
See the project and full documentation here: https://github.com/datastax/spark-cassandra-connector
Python pyspark Cassandra saveToCassandra Spark
Imagine your table is the following:
CREATE TABLE ks.test (
id uuid,
sampleId text,
validated boolean,
cell text,
gene text,
state varchar,
data bigint, PRIMARY KEY (id, sampleId) );
How you can update only the 'validated' field for a given sampleId in the test table in the keyspace ks ? You can use the following line to update the table in Python.
from pyspark import SparkConf
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext
conf = SparkConf().set("spark.cassandra.connection.host", <IP1>).set("spark.cassandra.connection.native.port",<IP2>)
sparkContext = CassandraSparkContext(conf = conf)
rdd = sparkContext.parallelize([{"validated":False, "sampleId":"323112121", "id":"121224235-11e5-9023-23789786ess" }])
rdd.saveToCassandra("ks", "test", {"validated", "sample_id", "id"} )