Strategy for Handling History Time series Data in Cassandra - apache-spark

I'am working on insane Time Series Data. So, I have two Kafka Topic -
1) Real time Time-Series Data of moving vehicles every 5 seconds.
2) History Time-Series Data of 10% of vehicles in case vehicles travels in remote area so, data is send once it comes into network, it may be after few hours, days or week.
So, my cassandra Table is somewhat like this
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), gpsdt)
) WITH CLUSTERING ORDER BY (gpsdt ASC)
& I'm using Spark Streaming to fetch data from Kafka and inserting into Cassandra, here clustering key is gpsdt. Whenever History data comes from Kafka, lot of shuffle happens in table as we know the architecture of Cassandra. Data is nothing but stored in sequential order on the partition defined & for history entries records comes from between the lines. So, What happens is after a certain period of time, spark streaming application gets hang. After lot of search I found that there might be some problem with my table strategy, So if I create a table schema like this -
CREATE TABLE locationinfo (
imei text,
date text,
entrydt timestamp,
gpsdt timestamp,
lastgpsdt timestamp,
latitude text,
longitude text,
odo int,
speed int,
PRIMARY KEY ((imei, date), entrydt)
) WITH CLUSTERING ORDER BY (entrydt ASC)
Here order is defined as per insertion time so whenever history data will come it will always append in the last and there will be no overhead of shuffling. But, in this case I wont be able to make range queries on gpsdt. So, I would like to know what should be the best strategy to handle this scenario. My load from kafka is more than 2k/sec.

Related

Cassandra Data modelling with multiple tables for same data

Cassandra Data Modeling Query
Hello,
The data model i am working on is as below with different tables for same data data set for satisfying different kinds of query. The data mainly stores event data of some campaigns sent out on multiple channels like email, web, mobile app, sms etc. Events can include page visits, email opens, link clicks etc for different subscribers.
Table 1:
(enterprise_id int, domain_id text, campaign_id int, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, ........) (many more columns not part of primary key)
PRIMARY KEY ((enterprise_id,campaign_id),domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 1:
I have partition key as enterprise_id + campaign_id . Each enterprise can have several campaigns . The datastore may have data for few hundred campaigns. Each campaign can have upto 2-3 million records. Hence there may be 3000 partitions across 100 enterprises and each partition having 2-3 miilion records.
Cassandra Queries: Query always with partition key + primary key including the datetime field. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. enterprise_id +c ampaign_id is always available as a filter in the queries.
Table 2:
(enterprise_id int, domain_id text, event_category text, event_action text, datetime timestamp, subscription_id text, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, domain_id, event_category, event_action, datetime, subscription_id))
CLUSTERING ORDER BY (domain_id DESC, event_category DESC, event_action DESC, datetime DESC, subscription_id DESC)
Keys and Data size for Table 2) : I have partition key as enterprise_id only. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 - 900 million entries
Cassandra Queries: Query always with partition key + primary key upto datetime. The subscription id is included in primary key to keep each record unique as we can have multiple records with similar values for rest of the keys in primary key. In this case, data has to be queries across campaigns and we may not have campaign_id as a filter in the queries.
Table 3:
(enterprise_id int, subscription_id text, domain_id text, event_category text, event_action text, datetime timestamp, event_label text, campaign_id int........) (many more columns not part of primary key)
PRIMARY KEY (enterprise_id, subscription_id, domain_id, event_category, event_action, datetime, ))
CLUSTERING ORDER BY ( subscription_id DESC, domain_id DESC, event_category DESC, event_action DESC, datetime DESC,)
Keys and Data size for Table 3) : I have partition key as enterprise_id. Each enterprise can have several campaigns . May be few hundred campaigns. Each campaign can have upto 2-3 Mn records. In this case the partition is quite big with data for all campaigns in a single partition. can have upto 800 -900 million entries
Cassandra Queries: Query always with partition key + primary key as subscription_id only. Should be able to query directly on enterprise_id + subscription_id.
My Queries:
Size of data on each partition: With Table 2) and Table 3) i may end up with more than 800 -900 million rows per partition. As per my reading it is not ok to have so many entries per partition. How can i achieve my use case in this scenario? Even if i create multiple partitions based on some data like a week_number (1-52 in a year), the query will need to query across all partitions and end up using a IN clause with all week numbers which is as good as scanning all data.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change? For example in Table 2 and Table 3 the hash will be on enterprise_id and will lead to same node. However only the clustering key order has changed and will allow me to query directly on the required key. Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
Is it ok to use ALLOW FILTERING if i specify the partition key. For example i can avoid the need for creating Table 3 and use table 2 for query on subscription_id directly if i use ALLOW FILTERING on Table 2. What will be the impact again.
First of all, please only as one question per question. Given the length and detail required for your answers, this post is unlikely to provide long term value for future users.
As per my reading it is not ok to have so many entries per partition. How can I achieve my use case in this scenario?
Unfortunately, if partitioning on a time component will not work, then you'll have to find some other column to partition the data by. I've seen rows-per-partition work ok in the range of 50k to 20k. Most of those use cases on the higher end had small partitions. It looks like your model has many columns, so I'd be curious as to the average partition size. Essentially, find a column to partition on which keeps your partition sizes in the 10MB to 1MB range.
Is it ok to have multiple tables with same partition key and different primary keys with Clustering order change?
Yes, this is perfectly fine.
Will the data be in different physical partitions for Table2 and Table3 in such a scenario? Or if it maps to same partition number how will cassandra internally distinguish between the two tables?
The partition is hashed into a number ranging from +/- 2^63. That number will then be compared to the partition ranges mapped to all nodes, and then the query will be sent to that node. So all the partition does, is determine which node is responsible for the data.
The tables have their data files written to different directories, based on table name. So Cassandra distinguishes between the tables by the table name provided in the query. Nothing you need to worry about.
Is it ok to use ALLOW FILTERING if I specify the partition key.
I would still recommend against it if you're concerned about performance. But the good thing about using the ALLOW FILTERING directive while specifying a full partition key, will indeed prevent Cassandra from reading multiple nodes to build the result set. So that should be ok. The only drawback here, is that Cassandra stores/reads data from disk by the defined CLUSTERING ORDER, and using ALLOW FILTERING obviously complicates that process (forcing random reads vs. sequential reads).

How to handle designing issue of cassandra table output used into another cassandra table input?

Have two table as below :
CREATE TABLE model_vals (
model_id int,
data_item_code text,
date date,
data_item text,
pre_cal1 text,
pre_cal2 text,
pre_cal3 text,
pre_cal4 text,
pre_cal5 text,
pre_cal6 text,
PRIMARY KEY (( model_id, data_item ), date)
) WITH CLUSTERING ORDER BY ( date DESC )
CREATE TABLE prapre_calulated_vals (
id int,
precal_code text,
date date,
precal_item text,
pre_cal1 text,
pre_cal2 text,
pre_cal3 text,
pre_cal4 text,
pre_cal5 text,
pre_cal6 text,
PRIMARY KEY (( id, precal_item ), date)
) WITH CLUSTERING ORDER BY ( date DESC )
After processing input data from Kafka , using spark-sql, the result data is inserted into first (model_vals) C* table. Which further serve some web-service end points.
Another business logic need data from above first(model_vals) C* table, process it an populate restuls in second (prapre_calulated_vals) C* table.
For web-service endpoint , end-user can pass require where condition and get the data from first(model_vals) C* table.
But further processing I need to read the entire first(model_vals) C* table,
process the data , do other set of calculation and populate second (prapre_calulated_vals) C* table.
First(model_vals) C* table has million of records , so we cant load the entire table at a time to process ..
How to handle this scenario in C* ? What alternatives I have to handle this situation ?
You have several options depending on the complexity of what you need done. In general it sounds like you need some sort of streaming framework that simultaneously with writing new data to your records, also does some business logic and writes to a second table.
Some technologies that come to mind are,
Spark Streaming
Flink
Apex
All of these technologies have connectors for Cassandra that enable reading both entire tables as well as portions of tables in efficient manners for doing joins with new data. Of course this will be slower than aggregation techniques on flat files or doing smaller requests of tiny amounts of data.
If you don't need a streaming approach, since you are already using Spark, I would suggest using a subsequent SparkSQL query to populate your final table.

Fetch the specific number of time series data from Cassandra

Obviously when dealing with time-series data which relates to some natural partition key like sensor id it can be used as a primary key. But what to do if we are interested in a global view and there is no natural candidate for the partition key? If we model the schema like this:
CREATE TABLE my_data(
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
It is (probably) going to work just fine for most cases but given we know what year and days to fetch.
What if we don't care what day is it but we expect to see first 50 most recent items? What if we then want to see next 50 items? Is there a way to do it in Cassandra? What is the recommended way of doing this?
Keep a 2nd table of the year/days. When reading can grab from it first. When adding to my_data update that as well but keep a cache of days inserted so each app would only try the insert once per day. ie for example adding extra key so can have multiple streams not just a single table per time series:
CREATE TABLE my_data (
key blob,
year smallint,
day smallint,
date timestamp,
value text
PRIMARY KEY ((key, year, day), timestamp)
) WITH CLUSTERING ORDER BY (date DESC);
CREATE TABLE my_data_keys (
key blob,
year smallint,
day smallint,
PRIMARY KEY ((key), year, day)
)
For inserts:
INSERT INTO my_data_keys (key, year, day) VALUES (0x01, 1, 2)
INSERT INTO my_data ...
Then keep a in memory Set somewhere that you stored that key/year/data so you dont need to insert it every time. To read most recent:
SELECT year, day FROM my_data_keys WHERE key = 0x01;
driver returns iterator, for each element in it make query to my_data until 50 records reached.
If inserts are frequent enough can just work backwards from "today", issuing queries until you get 50 events. If data sparse though that can be a lot of wasted reads and another table work better.

How to model inbox

How would ago about modelling the data if I have a web app for messaging and I expect the user to either see all the messages ordered by date, or see the messages exchanged with a specific contact, again ordered by date.
Should I have two tables, called "global_inbox" and "contacts_inbox" where I would add each message to both?
For example:
CREATE TABLE global_inbox(user_id int, timestamp timestamp,
message text, PRIMARY KEY(user_id, timestamp)
CREATE TABLE inbox(user_id int, contact_id int,
timestamp timestapm, message text,
PRIMARY KEY(user_id, contact_id, timestamp)
This means that every message should be copied 4 times, 2 for sender and 2 for receiver. Does it sound reasonable?
Yes, It's reasonable.
You need some modification.
Inbox table : If a user have many contact and every contact send message, then a huge amount of data will be inserted into a single partition (user_id). So add contact_id to partition key.
Updated Schema :
CREATE TABLE inbox (
user_id int,
contact_id int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id, contact_id), timestamp)
);
global_inbox : Though It's global inbox, a huge amount of data can be inserted into a single partition (user_id). So add more key to partition key to more distribution.
Updated Schema :
CREATE TABLE global_inbox (
user_id int,
year int,
month int,
timestamp timestamp,
message text,
PRIMARY KEY((user_id,year,month), timestamp)
);
Here you can also add also add week to partition key, if you have huge data in a single partition in a week. Or remove month from partition key if you think not much data will insert in a year.
In term of queries performance, Yes it sounds good for me. Apache cassandra is really built in for this kind of data modeling. We build table to satisfy queries. This is the process called 'Denormalization' in Cassandra paradigm. This will increase queries performance. You have duplicated data but the main goal is to have fast queries.

How to design cassandra schema so additional columns can be easily added later?

I have defined table structure as defined below,
CREATE TABLE sensor_data (
asset_id text,
event_time timestamp,
sensor_type int,
temperature int,
humidity int,
voltage int,
co2_percent int
PRIMARY KEY(asset_id ,event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
this table captures data coming from a sensor and depending on type of sensor -- column sensor_type, some columns will have a value some others will not. Example temperature only applies to temperature sensor, humidity sensor applies to humidity sensor etc.
Now as I work with more and more sensor my intention is I will simply add additional columns using alter table command. Is this a correct strategy to follow or are there better ways to design this table for future use?
I've answered to a similar question few hours ago: here
Assuming you're Cassandra 2.X ready your situation is easier to handle, to perform what you need I'd use a Map
CREATE TABLE sensor_data (
asset_id text,
event_time timestamp,
sensor_type int,
sensor_info map<text, int>,
PRIMARY KEY(asset_id ,event_time)
) WITH CLUSTERING ORDER BY (event_time ASC)
Advantages is that your schema will remain the same even if new sensors come into your world. Disadvantage is that you won't be able to retrieve a specific data from your collection, you will always retrieve the collection in its entirely. If you're in Cassandra 2.1 secondary indexes on collections might help.
HTH,
Carlo

Resources