Data modeling : Data without uniqueness - cassandra

I have a use case where data needs to be dumped into DB, that is not having any uniqueness. Say some random data, that can have repeated values, generated at very high speed.
Now Cassandra has constraint of having partition key per table mandatory.
Even though I can introduce a TimeUUID column, but again problem comes while retrieving. That again can be handled using ALLOW FILTER in Select clause.
I am looking for some better approach. Anyone can suggest some other approach. Only constraint is I can only dump data in Cassandra DB, File system not available.

It seems like you just want to store your data without knowing yet how to query it. With Cassandra, you typically need to know how to query it before you design your data model. If you want to retrieve the full data set, you will have poor performance. You might want to consider hdfs instead.
If you really need to store in Cassandra, try to think of a way to store it that makes sense. For example, you could store your data in timebucket. Try to size your bucket to store about 1MB worth of data. If you produce 1MB of data per minute, then a minute bucket is appropriate. You would have a partition key as the minute of the date, then a clustering column as timeUUID, then the rest of your data to store.

Related

Secondary index on for low cardinality clustering column

Using Cassandra as db:
Say we have this schema
primary_key((id1),id2,type) with index on type, because we want to query by id1 and id2.
Does query like
SELECT * FROM my_table WHERE id1=xxx AND type='some type'
going to perform well?
I wonder if we have to create and manage another table for this situation?
The way you are planning to use secondary index is ideal (which is rare). Here is why:
you specify the partition key (id1) in your query. This ensures that
only the relevant partition (node) will be queried, instead of
hitting all the nodes in the cluster (which is not scalable)
You are (presumably) indexing an attribute of low cardinality (I can imagine you have maybe a few hundred types?), which is the sweet spot when using secondary indexes.
Overall, your data model should perform well and scale. Yet, if you look for optimal performances, I would suggest you use an additional table ((id1), type, id2).
Finale note: if you have a limited number of type, you might consider using solely ((id1), type, id2) as a single table. When querying by id1-id2, just issue a few parallel queries against the possible value of type.
The final decision needs to take into account your target latency, the disk usage (duplicating table with a different primary key is sometimes too expensive), and the frequency of each of your queries.

Partition key for Azure Cosmos DB collection

I am bit new to Azure Cosmos DB and trying to understand the concepts.
I want help to decide the the best possible partition key for DocumentDB collection. Please refer image below which have possible partitions using different partition keys.
As mentioned in the blog post here,
An ideal partition key is one that appears frequently as a filter in
your queries and has sufficient cardinality to ensure your solution is
scalable.
From above line, I think, in my case, UserId can be used as partition key.
Can someone please suggest me which key is the best possible candidate for partition key?
From the 10 things to know about DocumentDB Partitioned Collections and micro official document , you could find lots of very good advice about choice of partitioning key, so I'm not going to repeat here.
The selection of partitioning keys depends on the data stored in the database and the frequent query filtering criteria.
It is often advised to partition on something like userid which is good if you have. Suppose your business logic has many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions.
However, if you have millions of records for the user then partitioning on userid is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. You may need to find another column to be the partition key.
So , if the data volume is very large, I suggest that you do some simple tests based on your business logic and choose the best partitioning key for your performance. After all, the partitioning key cannot be changed once it is set up.
Hope it helps you.
It depends, but here are few things to consider:
The blog post you mentioned say:
Additionally, the storage size for documents belonging to the same partition key is limited to 10GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
Also, I really recommend to check this post and video, https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data,
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
So make sure to choose a partition Key that has many values and meets those requirements.

What is better in cassandra many tables of same structure or one table with many rows

Suppose that I have 1000 entities with exactly the same structure. For example all entities have three fields:
String id;
String name;
int amount;
Also I expect that there will be huge amount of every type of entity in the system.
So I have two variants right now:
For each entity create separate table which looks like:
CREATE TABLE <SOME_ENTITY_NAME> (
id text PRIMARY KEY,
name text,
amount int
)
I'll create only one table but with composite priamry key:
CREATE TABLE ALL_ENTITIES_TABLE (
entity_name text,
id text,
name text,
amount int,
PRIMARY KEY ((entity_name, id))
);
Of course, supporting only one table is more simplier, but what is with performance?
So, the question is what variant is better in terms of performance, taking into account that each type of entity will have millions(may be billions) of records?
There is a limitation on the number of the tables that can be created in the Cassandra cluster. Usual recommendation is too keep this number lower than 200, with ~500 is like a "hard stop"...
The reason for this is that every table requires allocation of additional memory, and other resources to keep auxiliary data, like, key/row caches, bloom filters, etc. Depending on the Cassandra version, every table may require 1-2Mb of memory.
So in your case, the 2nd design is better because you keep all data in single table, and your partition key will allow to spread data evenly between nodes of cluster.
In my opinion the first approach is incorrect in terms of maintainability. Too much of dynamically created tables should be tough to maintain. Also, If you use your partitioning/clustering order properly (as per the need of data retrieval) it should be easier and efficient to query. Also if you are using 3.x version of Cassandra, secondary indexes can come in handy.
NOTE: Secondary indexes don't allow sorting.
Cassandra was designed around the fact that disk space is the cheapest resource among all. You must build your data model around the queries that you will be using the most regardless of whether this model would consume more disk space or not - as long as it serves the purpose of your queries in the most efficient way. I wouldn't be able to answer your question without taking a look at the queries you will be using. In general, you must feel free to create as many tables as needed as long as it serves the purpose of your queries. I would recommend having a look here.

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

nosql separate data by client

I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.

Resources