SparkSQL spanning of Cassandra logical rows - apache-spark

I have a situation where I would like to "iterate" or map over "wide-rows" and not the logical Cassandra rows (CQL rows) using SparkSQL.
Basically my data is partitioned by timestamp (partition key) and there is a clustering key which is the sensor ID.
For each timestamp I would like to perform operations, a trivial example is to do sensor1/sensor2.
How could I do that efficiently with SparkSQL by keeping the data locality (and I think that my data model is rather well suited for these tasks)?
I read this post on Datastax which mentions spanBy and spanByKey in the Cassandra connector. How would this be used with SparkSQL?
Example of pseudo-code (pySpark):
ds = sqlContext.sql("SELECT * FROM measurements WHERE timestamp > xxx")
# span the ds by clustering key
# filter the ds " sensor4 > yyy "
# for each wide-row do sensor4 / sensor1

It's not possible right now. The spanBy API is only accessible from the programmatic API. To enable it in SparkSQL, it would require extending the SparkSQL syntax to inject extra clause and it's a hard job...

Related

Date partition size 10GB read efficiently

We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.

SELECT DISTINCT Cassandra in Spark

I need a query that lists out the the unique Composite Partition Keys inside of spark.
The query in CASSANDRA: SELECT DISTINCT key1, key2, key3 FROM schema.table; is quite fast, however putting the same sort of data filter in a RDD or spark.sql retrieves results incredibly slowly in comparison.
e.g.
---- SPARK ----
var t1 = sc.cassandraTable("schema","table").select("key1", "key2", "key3").distinct()
var t2 = spark.sql("SELECT DISTINCT key1, key2, key3 FROM schema.table")
t1.count // takes 20 minutes
t2.count // takes 20 minutes
---- CASSANDRA ----
// takes < 1 minute while also printing out all results
SELECT DISTINCT key1, key2, key3 FROM schema.table;
where the table format is like:
CREATE TABLE schema.table (
key1 text,
key2 text,
key3 text,
ckey1 text,
ckey2 text,
v1 int,
PRIMARY KEY ((key1, key2, key3), ckey1, ckey2)
);
Doesn't spark use cassandra optimisations in its' queries?
How can I retreive this information efficiently?
Quick Answers
Doesn't spark use cassandra optimisations in its' queries?
Yes. But with SparkSQL only column pruning and predicate pushdowns. In RDDs it is manual.
How can I retreive this information efficiently?
Since your request returns quickly enough, I would just use the Java Driver directly to get this result set.
Long Answers
While Spark SQL can provide some C* based optimizations these are usually limited to predicate pushdowns when using the DataFrame interface. This is because the framework only provides limited information to the datasource. We can see this by doing an explain on the query you have written.
Lets start with the SparkSQL example
scala> spark.sql("SELECT DISTINCT key1, key2, key3 FROM test.tab").explain
== Physical Plan ==
*HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- Exchange hashpartitioning(key1#30, key2#31, key3#32, 200)
+- *HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation test.tab[key1#30,key2#31,key3#32] ReadSchema: struct<key1:string,key2:string,key3:string>
So your Spark example will actually be broken into several steps.
Scan : Read all the data from this table. This is means serializing every value from the C machine to the Spark Executor JVM, in other words lots of work.
*HashAggregate/Exchange/Hash Aggregate: Take the values from each executor, hash them locally then exchange the data between machines and hash again to ensure uniqueness. In layman's terms this means creating large hash structures, serializing them, running a complicated distributed sortmerge, then running a
hash again. (Expensive)
Why doesn't any of this get pushed down to C*? This is because Datasource (The CassandraSourceRelation in this case) is not given the information about the Distinct part of the query. This is just part of how Spark currently works. Docs on what is pushable
So what about the RDD version?
With RDDS we give a direct set of instructions to Spark. This means if you want to push something down it must be manually specified. Let's see the debug output of the RDD request
scala> sc.cassandraTable("test","tab").distinct.toDebugString
res2: String =
(13) MapPartitionsRDD[7] at distinct at <console>:45 []
| ShuffledRDD[6] at distinct at <console>:45 []
+-(13) MapPartitionsRDD[5] at distinct at <console>:45 []
| CassandraTableScanRDD[4] at RDD at CassandraRDD.scala:19 []
Here the issue is that your "distinct" call is a generic operation on an RDD and not specific to Cassandra. Since RDDs require all optimizations to be explicit (what you type is what you get) Cassandra never hears about this need for "Distinct" and we get a plan that is almost identical to our Spark SQL version. Do a full scan, serialize all of the data from Cassandra to Spark. Do a Shuffle and then return the results.
So what can we do about this?
With SparkSQL this is about as good as we can get without adding new rules to Catalyst (the SparkSQL/Dataframes Optimizer) to let it know that Cassandra can handle some distinct calls at the server level. It would then need to be implemented for the CassandraRDD subclasses.
For RDDs we would need to add a function like the already existing where, select, and limit, calls to the Cassandra RDD. A new Distinct call could be added here although it would only be allowable in specific situations. This is a function that currently does not exist in the SCC but could be added relatively easily since all it would do is prepend DISTINCT to requests and probably add some checking to make sure it is a DISTINCT that makes sense.
What can we do right now today without modifying the underlying connector?
Since we know the exact CQL request that we would like to make we can always use the Cassandra driver directly to get this information. The Spark Cassandra connector provides a driver pool we can use or we could just use the Java Driver natively. To use the pool we would do something like
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(sc.getConf).withSessionDo{ session =>
session.execute("SELECT DISTINCT key1, key2, key3 FROM test.tab;").all()
}
And then parallelize the results if they are needed for further Spark work. If we really wanted to distribute this it would be necessary to most likely add the function to the Spark Cassandra Connector as I described above.
As long as we are selecting the partition key, we can use the .perPartitionLimit function of the CassandraRDD:
val partition_keys = sc.cassandraTable("schema","table").select("key1", "key2", "key3").perPartitionLimit(1)
This works because, per SPARKC-436
select key from some_table per partition limit 1
gives the same result as
select distinct key from some_table
This feature was introduced in spark-cassandra-connector 2.0.0-RC1
and requires at least C* 3.6
Distinct has a bad performance.
Here there is a good answer with some alternatives:
How to efficiently select distinct rows on an RDD based on a subset of its columns`
You can make use of toDebugString to have an idea of how many data your code shuffles.

Spark-Cassandra: how to efficiently restrict partitions

After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo

Spark SQL and Cassandra JOIN

My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
Edit: fix the answer with regard to join optimization
is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Resources