Why collections shouldn't be used for unbounded data? - cassandra

From Cassandra docs:
A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like messages sent or sensor events registered every second, do not use collections.
Instead, use a table with a compound primary key where data is stored in the clustering columns.
I'm trying to understand why this is the case.
Let's say I have a messaging app and instead of using PrimaryKey(chatId, timestamp, messageId) I'd use something like PrimaryKey(chatId) with messages column where messages is a list of messages in a chat.

Understand what? You want to add the entire chat history i.e. all the messages in a single column of a single row? Would you do that in a regular sql db? No - there would be a table where each message is its own row
Apart from the fact that you will lose all ability to query the messages in the proposed schema - just the size of that one key would balloon up to the point that the ops required for the cluster will become a nightmare

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

Are client side joins permissable in Cassandra if client drills down on datapoint?

I have this structure with about 1000 data points in a list on the website:
Datapoint1:
Datapoint2:
...
Datapoint1000:
With each datapoint containing 6 fields of information.
Each datapoint can be opened to reveal an additional 2-3x of information in sublist.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra? Should I just go ahead and get it all in one go?
Should I just go ahead and get it all in one go?
Definitely not.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra?
That's absolutely the way you should do it. Cassandra is great at writing large amounts of data, but not so great a returning large amounts of data. More, small key-based queries are definitely the way to go.
It is possible to do the JOINs on the client side but as a general proposition, queries which require joins indicate that you possibly didn't design the data model correctly.
You need to model your data such that (a) each application query (b) maps to a single table. If you need to do a client-side JOIN then you need to query the database multiple times to get the data required by your app. It will work but it's not efficient so affects the performance of the app and the database.
To illustrate with an example, let's say you app needs to display a customer's list of orders. The table design would need to be partitioned by customer with (clustered) multiple rows of orders:
CREATE TABLE orders_by_customerid (
customerid text,
orderid text,
orderdate timestamp,
ordertotal decimal,
...
PRIMARY KEY (customerid, orderid)
)
You would retrieve the list of orders for a customer with:
SELECT ... FROM orders_by_customerid WHERE customerid = ?
By default, the driver or Stargate API your app is using would page the results so only the first 100 rows (for example) will be returned instead of retrieving thousands of rows in a single pass. Note that the page size is configurable. Cheers!

Selecting records in Cassandra based on Time range in frequent intervals

I have a table in Cassandra where i am storing events as they are coming in , different processing are done on the events at different stages. The events are entered into the table with the event occurrence time. I need to get all the events whose event time is less than a certain time and do some processing on them. As its a select range query and its invariably will use scatter gather. Can some one suggest best way to do this. This process is going to happen in every 5 secs and scatter gather happening in Cassandra happening frequently is not a good idea as its an overhead on Cassandra itself which will degrade my overall application Performance.
The table is as below:
PAS_REQ_STAGE (PartitionKey = EndpointID, category ; clusterkey= Automation_flag,alertID)
AlertID
BatchPickTime: Timestamp
Automation_Threshold
ResourceID
ConditionID
category
Automation_time: Timestamp
Automation_flag
FilterValue
Eventtime which i have referred above is the BatchPickTime..
A scheduler wakes up at regular interval and gets all the records whose BatchPickTime is Less than the current scheduler wakeup time and sweeps them off from the table to process them.
Because of this usecase i cannot provide any specific Partition key for the query as it will have to get all data which has expired and is less than the current scheduler wake-up time.
Hi and welcome to Stackoverflow.
Please post your schema and maybe some example code with your question - you can edit it :)
The Cassandra-way of doing this is to denormalize data if necessary and build your schema around your queries. In your case I would suggest putting your events in to a table together with a time bucket:
CREATE TABLE events (event_source int, bucket timestamp,
event_time timestamp, event_text text PRIMARY KEY ((event_source, bucket),event_time));
The reason for this is that it is very efficent in cassandra to select a row by its so called partition key (in this example (event_source, bucket)) as such a query hits only one node. The reminder of the primary key is called clustering columns and defines the order of data, here all events for a day inside the bucket are sorted by event_time.
Try to model your event table in a way that you do not need to make multiple queries. There is a good and free data modeling course from DataStax available: https://academy.datastax.com/resources/ds220-data-modeling
One note - be careful when using cassandra as queue - this is maybe an antipattern and you might be better of with a message queue as ActiveMQ or RabbitMQ or similar.

Audit Trail Design using Table Storage

I'm considering implementing an Audit Trail for my application in using Table Storage.
I need to be able to log all actions for a specific customer and all actions for entities from that customer.
My first guess was creating a table for each customer (Audits_CustomerXXX) and use as a partition key the entity id and row key the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") value. And this works great when my question is what happened to certain entity? For instance the audit of purchase would have PartitionKey = "Purchases/12345" and the RowKey as the timestamp.
But when I want a birds eye view from the entire customer, can I just query the table sorting by row key across partitions? Or is it better to create a secondary table to hold the data with different partition keys? Also when using the (DateTime.Max.Ticks - DateTime.Now.Ticks).ToString("D19") is there a way to prevent errors when two actions in the same partition happen in the same tick (unlikely but who knows...).
Thanks
You could certainly create a separate table for the birds eye view but you really don't have to. Considering Azure Tables are schema-less, you can keep this data in the same table as well. You would keep the PartitionKey as reverse ticks and RowKey as entity id. Because you would be querying only on PartitionKey, you can also keep RowKey as GUID as well. This will ensure that all entities are unique. Or you could append a GUID to your entity id and use that as RowKey.
However do keep in mind that because you're inserting two entities with different PartitionKey values, you will have to safegaurd your code against possible network failures as each entry will be a separate request to Table service. The way we're handling this in our application is we write this payload to a queue message and then process that message through a background process.

Azure storage table records filtering advice

Imaging something like a blog posting system, built using Azure Storage Table.
A user posts a message and the database records user's Region, City and Language along with it.
After that, a user is able to browse all other user's posts and able to filter them by any combination of Region, City and Language. Or neither and see all of them.
I see several solutions:
Put each message in 8 different partitions with combinations of Region-City-Language (pros: lightning fast point queries on read; cons: 8 transactions per message on write).
Put each message in 4 different partitions with combinations of Region-City and an ability to do partition scan to filter by languages (pros: less transactions than (1); cons: partition scan, 4 transactions per message).
Put each message in partitions, based on user's ID (pros: single transaction per message; cons: slow table scan and partition scan after that).
The way i see it:
Fast reads, slow (and perhaps costly) writes.
Balanced reads/writes/cost.
Fast writes, slow (but cheap) reads.
By "cost/cheap" i mean pricing based on transactions (not space).
And by "balanced" i mean just among these variants.
Thought about using index tables, but can't see how they help here.
So the question is, perhaps there is another, better way?
I've decided to go with a variation of (1).
The difference is that i won't be storing ALL of combinations for Region-Location-Language. Instead i decided to store only uniques:
Table: FiltersByRegion
----------------------
Partition: Region
Row: Location.Language
Prop: Message
Table: FiltersByRegionPlace
---------------------------
Partition: Region.Location
Row: Language
Prop: Message
Table: FiltersByRegionLanguage
------------------------------
Partition: Region.Language
Row: Location
Prop: Message
Table: FiltersByLanguage
------------------------
Partition: Language
Row: Region.Location
Prop: Message
Because of the fact that i'm storing only uniques there won't be a lot of transactions per every post. Only those, that are not already present in database.
In other words, if there are a lot of posts from the same region-location-language, filter tables won't be updated and transactions won't be spent. Tests for uniques could use Redis to speed things a bit.
Filtering is now only a matter of picking the right table.
It depends on your scenarios and read/write pattern. You might want to consider some aspects:
Design for how the records will be queried. Putting them into a "Region-City-Language" partition with message ID as entity data may help in your fast query.
Each message may have a unique message ID and ID-Message mappings are saved in other tables, then every time you only need to update one table when a message is updated and the message ID referenced in other tables keeps unchanged.
Leverage ParitionKey and RowKey in your table design and query entities with both keys. For instance: "Region-City-Language" as partition keys and "User" as row keys.
Consider storing duplicate copies of entities for query scenarios. For example, if you have heavy users based and language based queries, you may consider have two tables with "user" and "language" as keys respectively.
Please also refer to https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/ for a full guide.

Resources