Perform queries over the time-series stream of data - cassandra

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?

If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.

Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

Related

What is the performance difference between stream.filter instead of CQL ALLOW FILTERING?

The data in my Cassandra DB table doesn't have much data right now.
However, since it is a table where data is continuously accumulated, I am interested in performance issues.
First of all, please don't think about the part where you need to redesign the table.
Think of it as a general RDBS date-based lookup. (startDate ~ endDate)
From Cassandra DB
Apply allow filtering and force the query.
This will get you exactly the data you want.
Query "all data" in Cassandra DB, This query only needs to be done once. (no where)
After that, only the data within the desired date is extracted through the stream().filter() function.
Which method would you choose?
In general, which one has more performance issues?
Summary: You need to do about 6 methods.
Execute allow filtering query 6 times / Not perform stream filter
Execute findAll query once / Execute stream filter 6 times
The challenge with both options is that neither will scale. It may work with very small data sets, say less than 1000 partitions, but you will quickly find that neither will work once your tables grow.
Cassandra is designed for real-time OLTP workloads where you are retrieving a single partition for real-time applications.
For analytics workloads, you should instead use Spark with the spark-cassandra-connector because it optimises analytics queries. Cheers!

Can hbase-spark connector be used for sorting hbase rows by some column with good performance?

Well the the title of the questions says it all. I have a requirement which requires getting row keys corresponding to top X (say top 10) values in certain column. Thus, I need to sort hbase rows by the desired column values. I don't understand how should I do this or even is doable or not. It seems that hbase does not cater to this very well. Also it does not allow any such functionality out of the box.
Q1. Can I use hbase-spark connector, load whole hbase data in spark rdd and then perform sorting in it? Will this be fast? How the connector and spark will handle it? Will it fetch whole data on single node or multiple nodes and sort in distributed manner?
Q2. Also is there any better way to do this?
Q3. Is it undoable in hbase at all? and should I opt for different framework/technology altogether?
A3. If you need to sort your data by some column (not row-key), you get no benefit from using HBase. It'll be the same as reading raw files from hive/hdfs and sort, but slower.
A1. Sure you can use SHC or any other spark-hbase library for that matter, but A3 still holds. It will load the entire data on every region server as Spark RDD, only to shuffle it across your entire cluster.
A2. As any other programming/architecture issue, there are many possible solutions depending on your resources and requirements.
Will spark load all the data on single node and do sorting on single node or will it perform sorting on different nodes?
It depends on two factors:
How many regions your table has: This determines the parallelism degree (number of partitions) for reading from your table.
spark.sql.shuffle.partitions configuration value: After loading the data from the table, this value determines the parallelism degree for the sorting stage.
is there any better [library] than the SHC?
As for today there are multiple libraries for integrating Spark with HBase, each has its own pros and cons, and TMO none of them is fully mature or gives full coverage (compared Spark-Hive integration, for example). To get the best from Spark over HBase you should have a very good understanding of your use case and select the most suitable library.
Q2. Also is there any better way to do this?
If re-designing your HBase table is an option with this specific column value as part of the rowkey, this would allow fast access to these values as HBase is optimised for rowkey filters and not column filters.
You could then create a rowkey concatenation of the existing_rowkey + this_col_value. Querying it then with a Row Filter would have better performance results.

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

PySpark Cassandra Connector efficiently querying across partition keys

I'm faced with the following problem using PySpark and dataframes with the cassandra-connector. My Cassandra data lake consists of metric measurements across (network) devices, and the entries are of type (device,interface,metric,time,value).
My cassandra table for the raw data has:
PRIMARY KEY ((device,interface,metric),time)
for supposedly efficient fetching of time ranges for a given measurement.
Now for reporting purposes, users can query any set of device/interface/metric combinations (ie give me a specific metric for all interfaces of a device). Now I know the list of each, so I'm not looking to do wildcard searches, but rather IN queries.
I'm using Spark 1.4, so I'm adding filters like to obtain dataframes to calculate min/max/percentile/etc... on the recorded metric values.
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device.inSet (device_list))\
.filter(metrics_raw.interface.inSet (interface_list))\
.filter(metrics_raw.metric.inSet (metric_list))
This isn't very efficient as these predicates do not get pushed down to CQL (only the last predicate can be an IN query), so I'm pulling in tons of data and filtering on the client side. (not good)
Why doesn't cassandra-connector allow multiple IN predicates across partition columns? Doing this in a native CQL shell appears to work?
Another approach to my problem above would be to (and this yields efficient individual queries as predicates are pushed down to Cassandra):
for device in device_list:
for interface in interface_list:
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device == device)\
.filter(metrics_raw.interface == interface)\
.filter(metrics_raw.metric.inSet (metric_list))
And then run the aggregation logic for each subquery, but I feel like this is largely serialising what should be a parallel computation across all requested device/interface/metric values... Can I batch the Cassandra queries so I can run my analytics on one large distributed dataframe?
Bottom line, I'm looking to do this very efficiently. If the turn-around times are short enough, we'll run these on-demand. If not, we'll need to look into pre-computing them and storing into tables (which sacrifices flexibility for doing custom time-range reporting)
Any insights would be much appreciated!!
Nik.

How to design a real time charging system together with spark and nosql databases

I would like to design a system that
Will be reading CDR (call data records) files and inserts them
into a nosql database. To achieve this spark streaming with Cassandra as nosql looks promising as the files will keep coming
Will be able to calculate real time price by rating the duration and called number or just kilobytes in case of data and store the total so far chargable amount for the current billcycle. I need a nosql that i will be both inserting rated cdrs and updating total so far chargable amount for the current billcycle for that msisdn in that cdr.
In case rate plans are updated for a specific subscription, for the current billcycle all the cdrs using that price plan needs to be recalculated and total so far amount needs to be calculated for all the customers
Notes:
Msisdns are unique for each subscription with one to one relation.
Within a month One msisdn can have up to 100000 cdrs.
I have been going through the nosql databases so far i am thinking to
use cassandra but i am still not sure how to design the database to
optimize for this business case.
Please also consider while one cdr is being processed in one node,
the other cdr for the same msisdn can be processed in another node
at the same time and both nodes doing the above logic.
The question is indeed very broad - StackOverflow is a meant to cover more specific technical questions and not debate architectural aspects of an entire system.
Apart from this, let me attempt to address some of the aspects of your questions:
a) Using streaming for CDR processing:
Spark Streaming is indeed a tool of choice for incoming CDRs, typically delivered over a message queueing system such as Kafka. It allows windowed operations, whcih com in handy when you need to calculate call charges over a set period (hours, days etc..). You can very easily combine existing static records, such as price plans from other databases, with your incoming CDRs in windowed operations. All of that in a robust and expansive API.
b) using Cassandra as a store
Cassandra has excellent scaling capabilities with instantaneous row access - for that, it's an absolute killer. However, in the case of TelCo industry setting, I would seriously question using it for anything else than MSISDN lookups and credit checks. Cassandra is essentially a columnar KV storage, and trying to store multi dimensional, essentially relational records such as price plans, contracts and the lot will give you lots of headaches. I would suggest storing your data in different stores, depending on the use cases. These could be:
CDR raw records in HDFS -> CDRs can be plentiful, and if you need to reprocess them, collecting them from HDFS will be more efficient
Bill summaries in Cassandra -> the itemized bill summaries is the result of CDR as initially processed by Spark Streaming. These are essentially columnar and can be perfectly stored in Cassandra
MSISDN and Credit information -> as mentioned above, that is also a perfect use case for Cassandra
price plans -> these are multi dimensional, more document oriented, and should be stored in databases that support such structures. You can perfectly use Postgres with JSON for that, as you you wouldn't expect more than a handful of plans.
To conclude this, you're actually looking at a classic lambda use case with Spark Streaming for immediate processing of incoming CDRs, and batch processing with regular Spark on HDFS for post processing, for instance when you're recalculating CDR costs after plan changes.

Resources