SELECT DISTINCT Cassandra in Spark - apache-spark

I need a query that lists out the the unique Composite Partition Keys inside of spark.
The query in CASSANDRA: SELECT DISTINCT key1, key2, key3 FROM schema.table; is quite fast, however putting the same sort of data filter in a RDD or spark.sql retrieves results incredibly slowly in comparison.
e.g.
---- SPARK ----
var t1 = sc.cassandraTable("schema","table").select("key1", "key2", "key3").distinct()
var t2 = spark.sql("SELECT DISTINCT key1, key2, key3 FROM schema.table")
t1.count // takes 20 minutes
t2.count // takes 20 minutes
---- CASSANDRA ----
// takes < 1 minute while also printing out all results
SELECT DISTINCT key1, key2, key3 FROM schema.table;
where the table format is like:
CREATE TABLE schema.table (
key1 text,
key2 text,
key3 text,
ckey1 text,
ckey2 text,
v1 int,
PRIMARY KEY ((key1, key2, key3), ckey1, ckey2)
);
Doesn't spark use cassandra optimisations in its' queries?
How can I retreive this information efficiently?

Quick Answers
Doesn't spark use cassandra optimisations in its' queries?
Yes. But with SparkSQL only column pruning and predicate pushdowns. In RDDs it is manual.
How can I retreive this information efficiently?
Since your request returns quickly enough, I would just use the Java Driver directly to get this result set.
Long Answers
While Spark SQL can provide some C* based optimizations these are usually limited to predicate pushdowns when using the DataFrame interface. This is because the framework only provides limited information to the datasource. We can see this by doing an explain on the query you have written.
Lets start with the SparkSQL example
scala> spark.sql("SELECT DISTINCT key1, key2, key3 FROM test.tab").explain
== Physical Plan ==
*HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- Exchange hashpartitioning(key1#30, key2#31, key3#32, 200)
+- *HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation test.tab[key1#30,key2#31,key3#32] ReadSchema: struct<key1:string,key2:string,key3:string>
So your Spark example will actually be broken into several steps.
Scan : Read all the data from this table. This is means serializing every value from the C machine to the Spark Executor JVM, in other words lots of work.
*HashAggregate/Exchange/Hash Aggregate: Take the values from each executor, hash them locally then exchange the data between machines and hash again to ensure uniqueness. In layman's terms this means creating large hash structures, serializing them, running a complicated distributed sortmerge, then running a
hash again. (Expensive)
Why doesn't any of this get pushed down to C*? This is because Datasource (The CassandraSourceRelation in this case) is not given the information about the Distinct part of the query. This is just part of how Spark currently works. Docs on what is pushable
So what about the RDD version?
With RDDS we give a direct set of instructions to Spark. This means if you want to push something down it must be manually specified. Let's see the debug output of the RDD request
scala> sc.cassandraTable("test","tab").distinct.toDebugString
res2: String =
(13) MapPartitionsRDD[7] at distinct at <console>:45 []
| ShuffledRDD[6] at distinct at <console>:45 []
+-(13) MapPartitionsRDD[5] at distinct at <console>:45 []
| CassandraTableScanRDD[4] at RDD at CassandraRDD.scala:19 []
Here the issue is that your "distinct" call is a generic operation on an RDD and not specific to Cassandra. Since RDDs require all optimizations to be explicit (what you type is what you get) Cassandra never hears about this need for "Distinct" and we get a plan that is almost identical to our Spark SQL version. Do a full scan, serialize all of the data from Cassandra to Spark. Do a Shuffle and then return the results.
So what can we do about this?
With SparkSQL this is about as good as we can get without adding new rules to Catalyst (the SparkSQL/Dataframes Optimizer) to let it know that Cassandra can handle some distinct calls at the server level. It would then need to be implemented for the CassandraRDD subclasses.
For RDDs we would need to add a function like the already existing where, select, and limit, calls to the Cassandra RDD. A new Distinct call could be added here although it would only be allowable in specific situations. This is a function that currently does not exist in the SCC but could be added relatively easily since all it would do is prepend DISTINCT to requests and probably add some checking to make sure it is a DISTINCT that makes sense.
What can we do right now today without modifying the underlying connector?
Since we know the exact CQL request that we would like to make we can always use the Cassandra driver directly to get this information. The Spark Cassandra connector provides a driver pool we can use or we could just use the Java Driver natively. To use the pool we would do something like
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(sc.getConf).withSessionDo{ session =>
session.execute("SELECT DISTINCT key1, key2, key3 FROM test.tab;").all()
}
And then parallelize the results if they are needed for further Spark work. If we really wanted to distribute this it would be necessary to most likely add the function to the Spark Cassandra Connector as I described above.

As long as we are selecting the partition key, we can use the .perPartitionLimit function of the CassandraRDD:
val partition_keys = sc.cassandraTable("schema","table").select("key1", "key2", "key3").perPartitionLimit(1)
This works because, per SPARKC-436
select key from some_table per partition limit 1
gives the same result as
select distinct key from some_table
This feature was introduced in spark-cassandra-connector 2.0.0-RC1
and requires at least C* 3.6

Distinct has a bad performance.
Here there is a good answer with some alternatives:
How to efficiently select distinct rows on an RDD based on a subset of its columns`
You can make use of toDebugString to have an idea of how many data your code shuffles.

Related

Date partition size 10GB read efficiently

We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.

How to make a Cassandra partition feel like a wide-row in Spark?

Cassandra exposes its partitions as multiples rows, however internally that are stored as wide-rows and that is the way I would like to work on my data with Spark.
To be more specific I will, one way or another get a RDD of Cassandra partitions, or a dataframe of these.
Then I would like to do a map operation, and in the closure, I would like to express something like this:
row['parameter1']['value'] / len(row['parameter2']['vector_value'])
pseudo code just to give an idea, a simple division and taking the lenght of a vector.
My table would be
create table(
dataset_name text,
parameter text,
value real,
vector_value list<real>,
primary key(dataset_name, parameter));
How can I do that efficiencly? Using with PySpark.
I think I need something like Pandas set_index.
Logically, RDD groupBy seems to me to be what you want to do.
RDD groupBy is said to be bad for large grouping, but here we are grouping on a cassandra partition, so it is supposed to be kept in a spark partition, and it is supposed to be locally as all rows of one partition would be on the same node.
I'm more using Scala with Spark than Python, so let's try. But I did not tested it.
I would suggest
rdd =sc.cassandraTable('keyspace','table').map(lambda x: ( (x.dataset_name, (x.parameter , value, vector_value) )) // create the key to group on
rdd2 = sorted ( rdd.groupByKey() ) // GroupByKey returns (key, Iterator), hence sorted to get a list
Look groupBy / groupByKey functions
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD
You will get one row per partition, and inside each partition, a list of clustering rows. so you should be able to access with [0] for the first occurence so 'parameter1', then [1] for 'parameter2'
EDIT: A colleague told me spark-cassandra-connector provides RDD methods to make what you want, ie preserving clustering column grouping and ordering. They are called spanBy / spanByKey : https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

SparkSQL spanning of Cassandra logical rows

I have a situation where I would like to "iterate" or map over "wide-rows" and not the logical Cassandra rows (CQL rows) using SparkSQL.
Basically my data is partitioned by timestamp (partition key) and there is a clustering key which is the sensor ID.
For each timestamp I would like to perform operations, a trivial example is to do sensor1/sensor2.
How could I do that efficiently with SparkSQL by keeping the data locality (and I think that my data model is rather well suited for these tasks)?
I read this post on Datastax which mentions spanBy and spanByKey in the Cassandra connector. How would this be used with SparkSQL?
Example of pseudo-code (pySpark):
ds = sqlContext.sql("SELECT * FROM measurements WHERE timestamp > xxx")
# span the ds by clustering key
# filter the ds " sensor4 > yyy "
# for each wide-row do sensor4 / sensor1
It's not possible right now. The spanBy API is only accessible from the programmatic API. To enable it in SparkSQL, it would require extending the SparkSQL syntax to inject extra clause and it's a hard job...

Spark SQL and Cassandra JOIN

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
Edit: fix the answer with regard to join optimization
is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Resources