PySpark Cassandra Connector efficiently querying across partition keys - apache-spark

I'm faced with the following problem using PySpark and dataframes with the cassandra-connector. My Cassandra data lake consists of metric measurements across (network) devices, and the entries are of type (device,interface,metric,time,value).
My cassandra table for the raw data has:
PRIMARY KEY ((device,interface,metric),time)
for supposedly efficient fetching of time ranges for a given measurement.
Now for reporting purposes, users can query any set of device/interface/metric combinations (ie give me a specific metric for all interfaces of a device). Now I know the list of each, so I'm not looking to do wildcard searches, but rather IN queries.
I'm using Spark 1.4, so I'm adding filters like to obtain dataframes to calculate min/max/percentile/etc... on the recorded metric values.
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device.inSet (device_list))\
.filter(metrics_raw.interface.inSet (interface_list))\
.filter(metrics_raw.metric.inSet (metric_list))
This isn't very efficient as these predicates do not get pushed down to CQL (only the last predicate can be an IN query), so I'm pulling in tons of data and filtering on the client side. (not good)
Why doesn't cassandra-connector allow multiple IN predicates across partition columns? Doing this in a native CQL shell appears to work?
Another approach to my problem above would be to (and this yields efficient individual queries as predicates are pushed down to Cassandra):
for device in device_list:
for interface in interface_list:
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device == device)\
.filter(metrics_raw.interface == interface)\
.filter(metrics_raw.metric.inSet (metric_list))
And then run the aggregation logic for each subquery, but I feel like this is largely serialising what should be a parallel computation across all requested device/interface/metric values... Can I batch the Cassandra queries so I can run my analytics on one large distributed dataframe?
Bottom line, I'm looking to do this very efficiently. If the turn-around times are short enough, we'll run these on-demand. If not, we'll need to look into pre-computing them and storing into tables (which sacrifices flexibility for doing custom time-range reporting)
Any insights would be much appreciated!!
Nik.

Related

What is the performance difference between stream.filter instead of CQL ALLOW FILTERING?

The data in my Cassandra DB table doesn't have much data right now.
However, since it is a table where data is continuously accumulated, I am interested in performance issues.
First of all, please don't think about the part where you need to redesign the table.
Think of it as a general RDBS date-based lookup. (startDate ~ endDate)
From Cassandra DB
Apply allow filtering and force the query.
This will get you exactly the data you want.
Query "all data" in Cassandra DB, This query only needs to be done once. (no where)
After that, only the data within the desired date is extracted through the stream().filter() function.
Which method would you choose?
In general, which one has more performance issues?
Summary: You need to do about 6 methods.
Execute allow filtering query 6 times / Not perform stream filter
Execute findAll query once / Execute stream filter 6 times
The challenge with both options is that neither will scale. It may work with very small data sets, say less than 1000 partitions, but you will quickly find that neither will work once your tables grow.
Cassandra is designed for real-time OLTP workloads where you are retrieving a single partition for real-time applications.
For analytics workloads, you should instead use Spark with the spark-cassandra-connector because it optimises analytics queries. Cheers!

Can hbase-spark connector be used for sorting hbase rows by some column with good performance?

Well the the title of the questions says it all. I have a requirement which requires getting row keys corresponding to top X (say top 10) values in certain column. Thus, I need to sort hbase rows by the desired column values. I don't understand how should I do this or even is doable or not. It seems that hbase does not cater to this very well. Also it does not allow any such functionality out of the box.
Q1. Can I use hbase-spark connector, load whole hbase data in spark rdd and then perform sorting in it? Will this be fast? How the connector and spark will handle it? Will it fetch whole data on single node or multiple nodes and sort in distributed manner?
Q2. Also is there any better way to do this?
Q3. Is it undoable in hbase at all? and should I opt for different framework/technology altogether?
A3. If you need to sort your data by some column (not row-key), you get no benefit from using HBase. It'll be the same as reading raw files from hive/hdfs and sort, but slower.
A1. Sure you can use SHC or any other spark-hbase library for that matter, but A3 still holds. It will load the entire data on every region server as Spark RDD, only to shuffle it across your entire cluster.
A2. As any other programming/architecture issue, there are many possible solutions depending on your resources and requirements.
Will spark load all the data on single node and do sorting on single node or will it perform sorting on different nodes?
It depends on two factors:
How many regions your table has: This determines the parallelism degree (number of partitions) for reading from your table.
spark.sql.shuffle.partitions configuration value: After loading the data from the table, this value determines the parallelism degree for the sorting stage.
is there any better [library] than the SHC?
As for today there are multiple libraries for integrating Spark with HBase, each has its own pros and cons, and TMO none of them is fully mature or gives full coverage (compared Spark-Hive integration, for example). To get the best from Spark over HBase you should have a very good understanding of your use case and select the most suitable library.
Q2. Also is there any better way to do this?
If re-designing your HBase table is an option with this specific column value as part of the rowkey, this would allow fast access to these values as HBase is optimised for rowkey filters and not column filters.
You could then create a rowkey concatenation of the existing_rowkey + this_col_value. Querying it then with a Row Filter would have better performance results.

What is your approach for querying Cassandra with Spark (in R or Python)?

I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?
A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Perform queries over the time-series stream of data

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?
If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.
Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

Resources