We are pulling data from around 20-25 industrial motor sensors and data is being stored in cassandra database.Cassandra as of now is running in a single node.
Below is the table structure
CREATE TABLE cisonpremdemo.machine_data (
id uuid PRIMARY KEY,
data_temperature bigint,
data_current bigint,
data_timestamp timestamp,
deviceid text,
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND default_time_to_live = 7884000
AND gc_grace_seconds = 100;
CREATE INDEX deviceid_idx ON db.machine_data (deviceid);
CREATE INDEX data_timestamp_idx ON db.machine_data (data_timestamp);
Data is being collected in this table for couple of months say at every 5 seconds for almost 24 hours so there is pretty huge volume of data.
I am trying to execute a date range based query using java and dotnet and in both cases i am getting time out errors (Cassandra failure during read query at consistency LocalOne (0 replica(s) responded over 1 required))
Query works fine if i give limit of 100 otherwise it fails anything above than that.Some of the things i have tried...
1) increased query time out.
2) reduced gc_grace_seconds to 100 (temporarily) to eliminate any tombstones.
Query used
SELECT data_temperature AS "DATA_TEMP",data_current AS "DATA_CURRENT" FROM machine_data
WHERE DATA_TIMESTAMP>=1517402474699
AND DATA_TIMESTAMP<=1517402774699
AND DEVICEID='BP_100' ALLOW FILTERING;
Not sure if i the table structure (primary key) is of a wrong choice. should it be both deviceid and timestamp ??
The secondary indexes will almost surely fail. They should have "not to low, not to high" cardinality (which depends on # of nodes in ring). Its very hard to get right and you should really just avoid using it unless have strong need and the data fits (cross table consistency not possible with a denormalized table).
Another thing you should never use is allow filtering, thats there pretty much just for debugging/development and large spark job kinda things that are reading entire dataset. Its horribly expensive and will almost always result in timeouts long term.
Instead you should create new tables and also break them up by time so the partitions do not get too large. ie
CREATE TABLE cisonpremdemo.machine_data_by_time (
id uuid PRIMARY KEY,
data_temperature bigint,
data_current bigint,
data_timestamp timestamp,
yymm text,
deviceid text,
PRIMARY KEY ((deviceid, yymm), data_timestamp)
) WITH CLUSTERING ORDER BY (data_timestamp DESC);
When you insert your data, write to both. You should essentially create a table for each kind of request you have, so the data is in the format you need it. Do not model your table around how the data looks. If you do not need direct message lookups by uuid, do not make the machine_data table like you have above at all since thats not how you are querying it.
Related
I have basically a data table like that (a partition id, along with a serialized value serialized_value) :
CREATE TABLE keyspace.data (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'enabled': 'true'}
AND compression = { 'class' : 'LZ4Compressor'};
The usecase involves to maintain multiples versions of the data (serialized_value for a given id).
Every day, I will have to sent into Cassandra a fresh version of the data. It involves 100 millions of rows/partitions each time.
Of course, I don't need to maintain ALL version of the data, only the last 4 days (so the four most recent version_id).
I identify three solutions to do that :
solution 1 : TTL
the idea is to set a TTL at insert time. In that a way, oldest versions of the data are automatically dropped, without having problems related to thombstones.
pros :
no read performance penalty (?)
no problem related to thombstones
cons :
if fails occur with ingestion several days, I may loose all the data from the Cassandra cluster because of TTL automatic delete
solution 2 : dynamic tables
the table creation becomes :
CREATE TABLE keyspace.data_{version_id} (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) ...;
the table name include the version_id.
pros :
the table (corresponding to a version) is easy to delete
no read performance penalty
no problem related to thombstones
cons :
dynamically adding a table to the cluster might need all the nodes to be up every time.
a bit more difficult to handle client side (query specific table name, instead of the same one)
solution 3 : large amount of deletes
in that case, all the data stay in a single table, and a version_id is added to the primary key.
CREATE TABLE keyspace.data (
version_id int,
id bigint,
serialized_value blob,
PRIMARY KEY ((version_id,id))
) ...;
pros :
only one single table to create and maintain, for the entire application lifecycle
cons :
read performance penalty may occurs because of lot of thombstones
problem related to thombstones, because large amount of data need to be deleted, in order to purge all data related to old version_id.
the delete will only match the exact partition key, so it will generate partition thombstones and NOT cell thombstones. but thus, I'm afraid of the performance of doing that..
What is the best way for you to acheive that ? :-)
It would be preferable to cluster your data based on a date or timestamp sorted in reverse order and still with TTL set. For example:
CREATE TABLE ks.blobs_by_id (
id bigint,
version timestamp,
serialized_value blob,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC)
If you have a default TTL on the table, older versions will automatically expire so when you retrieve the rows with:
SELECT ... FROM blobs_by_id WHERE id = ? LIMIT 4
Only the 4 most recent rows will be returned (in descending order) and you won't be iterating over the deleted rows. Cheers!
We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.
I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries
Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.
Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?
This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.