How to design a real time charging system together with spark and nosql databases - apache-spark

I would like to design a system that
Will be reading CDR (call data records) files and inserts them
into a nosql database. To achieve this spark streaming with Cassandra as nosql looks promising as the files will keep coming
Will be able to calculate real time price by rating the duration and called number or just kilobytes in case of data and store the total so far chargable amount for the current billcycle. I need a nosql that i will be both inserting rated cdrs and updating total so far chargable amount for the current billcycle for that msisdn in that cdr.
In case rate plans are updated for a specific subscription, for the current billcycle all the cdrs using that price plan needs to be recalculated and total so far amount needs to be calculated for all the customers
Notes:
Msisdns are unique for each subscription with one to one relation.
Within a month One msisdn can have up to 100000 cdrs.
I have been going through the nosql databases so far i am thinking to
use cassandra but i am still not sure how to design the database to
optimize for this business case.
Please also consider while one cdr is being processed in one node,
the other cdr for the same msisdn can be processed in another node
at the same time and both nodes doing the above logic.

The question is indeed very broad - StackOverflow is a meant to cover more specific technical questions and not debate architectural aspects of an entire system.
Apart from this, let me attempt to address some of the aspects of your questions:
a) Using streaming for CDR processing:
Spark Streaming is indeed a tool of choice for incoming CDRs, typically delivered over a message queueing system such as Kafka. It allows windowed operations, whcih com in handy when you need to calculate call charges over a set period (hours, days etc..). You can very easily combine existing static records, such as price plans from other databases, with your incoming CDRs in windowed operations. All of that in a robust and expansive API.
b) using Cassandra as a store
Cassandra has excellent scaling capabilities with instantaneous row access - for that, it's an absolute killer. However, in the case of TelCo industry setting, I would seriously question using it for anything else than MSISDN lookups and credit checks. Cassandra is essentially a columnar KV storage, and trying to store multi dimensional, essentially relational records such as price plans, contracts and the lot will give you lots of headaches. I would suggest storing your data in different stores, depending on the use cases. These could be:
CDR raw records in HDFS -> CDRs can be plentiful, and if you need to reprocess them, collecting them from HDFS will be more efficient
Bill summaries in Cassandra -> the itemized bill summaries is the result of CDR as initially processed by Spark Streaming. These are essentially columnar and can be perfectly stored in Cassandra
MSISDN and Credit information -> as mentioned above, that is also a perfect use case for Cassandra
price plans -> these are multi dimensional, more document oriented, and should be stored in databases that support such structures. You can perfectly use Postgres with JSON for that, as you you wouldn't expect more than a handful of plans.
To conclude this, you're actually looking at a classic lambda use case with Spark Streaming for immediate processing of incoming CDRs, and batch processing with regular Spark on HDFS for post processing, for instance when you're recalculating CDR costs after plan changes.

Related

Spark structured streaming real-time aggregation

Is it possible to output aggregation data on every trigger, before the aggregation time window is over?
Context: I'm developing an application that reads data from a Kafka topic, processes the data, aggregates it over a 1-hour window, and outputs to S3. However, The spark application understandably outputs the aggregation data to S3 only at the end of a given hour window.
The problem is that the end-users of the aggregated data in S3 can only have a semi real-time view, since they are always one hour late, waiting for the next aggregation to be outputted from the spark application.
Reducing the aggregation time window to something smaller than an hour would certainly help, but would generate a lot more data.
What could be done to enable real-time aggregation, as I call it, using minimal resources?
This is an interesting one and I do have a proposal but I'm not sure if this would really fit your minimal criteria. I'll describe the solution anyway...
If the end goal is to enable users to query data in real-time (or faster analytics in other words) then one way to achieve this is to introduce a database in your architecture that can handle fast inserts/updates - either a key-value store or a column oriented database. Below is a diagram that might help you in visualising this:
The idea is simple - just keep ingesting data into the first database and then keep offloading the data into S3 after a certain time i.e. either an hour or a day depending on your requirements. You could then register the metadata of both of these storage layers into a metadata layer (such as AWS Glue) - this may not always be necessary if you don't need a persistent metastore. On top of this, you could use something like Presto to query across both of these stores. This would also enable you to optimise your storage across these 2 data stores.
You'll obviously need to build the process to drop/delete the data partitions from the store you would be streaming in to and also to move data to S3.
This model is referred to as a tiered storage model or hierarchical storage model with sliding window pattern - Reference Article from Cloudera.
Hope this helps!

Is RDBMS with redundancy as good as nosql dbs?

I am reading about NoSQL DBs (Specifically Cassandra) and It says that Cassandra is faster for writing and queries are fast as well. Schema design is done more based on queries than based on data. For example, You have queries like in this example
then I have a question, Suppose I design the RDBMS schema similar to Cassandra's way and I ensure that no joins are required for queries. Will I get any significant performance gains still by using Cassandra(NoSql DBs)?
Cannot have an exact answer but few points,
JOIN is just of the many things - Cassandra stores the data physically based on the partition keys and hence making the read by partition as fast as possible.
On the performance side - its not about the performance at the beginning but keeping the performance consistent over a period of time. Say for example you have a time series like requirement where data is inserted every second, RDBMS performance will usually degrade as the data grows and not easy to keep up the index and stats up to date etc, while cassandra will fit better for a time series pattern and as the data grows its easy to scale up by adding nodes.
On the write performance - Cassandra's write workflow itself is different and is designed in a way to take up faster (the complicated process like merging sstabls, compaction etc happens in the background without affecting the actual write).
In short - you need to review the business case and make decision.

What timeseries database to select for large number of records?

I got into scenario where I have about 100,000 input records per seconds to store. The nature of records is timeseries data.
I need to run both aggregation, other analytics and also some machine learning algorithms over the data continuously. Performance is here the factor as I look for near real-time results.
What would you recommend as database engine?
Take a look at ClickHouse analytical database. It can accept millions of rows per second. It can scan billions of rows per second on a single computer. It scales horizontally to multiple nodes. It fits time series workloads.
If you still need time series database, then try VictoriaMetrics. It is built on ClickHouse ideas, so it is fast and resource-efficient.
I am adding my own solution...
ClickHouse is definitely nice killer. But I am now evaulating for new project open source gpu database OmniSci. Its open source version is limited to single gpu node (up to 16 gpu devices - with oem tesla having 64GB per device you can get 1TB VRAM, of course not that cheap as clickhouse). Its simply SQL database on steroids (JDBC driver exists) with Kafka data source
Omnisci is having also crossdashboarding solution which is licensed already, but you can have real time dashboarding over lets say 20-50 billions of ts records (8-16 gpus) and multidashboard real time analytics without any kind of preaggregation required, etc....
But it will cost money...
If you want going purely open source, my second candidate is NVIDA's RAPIDS framework which implements cuDF (CUDA Dataframe - like Spark data structure), eventually you can use it to keep your data window (append new, delete obsolete), and cuxfilter solution which is similar to OmniSci, but its more framework, but with skilled frontend coder you can achieve something very similar/same as OmniSci.
Of course you can go and implement your own on top of cassandra with an appropriate data model for your usecase. This will maybe get you the best results tailored to your needs.
You could look at KairosDB (https://kairosdb.github.io/) which is a timeseries database on top of apache cassandra and I got 50k writes per second on a medium sized single (but bare metal) node.
It's quite good documented (https://kairosdb.github.io/docs/build/html/CassandraSchema.html) and it has aggregators out of the box (https://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html).
OpenTSDB was slower in my tests. Influx looks promising but i have no experience myself: https://github.com/influxdata/influxdb

Perform queries over the time-series stream of data

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?
If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.
Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Resources