how many partition key for a Cassandra table? - cassandra

partition key for a Cassandra table?
In customer table customerid is partition key?
Suppose I have 1 million customers in year so I have 1 million partitions
After 10 years so I have 10 million customers or more also ... so I have 10 million paritions
SO my Question is ?
1) if I want read customers table (10 million partition) is that affect the read performance ?
note : In single partition we may have 50 to 100 columns ?

You have the right idea in that you'll want to use data modeling to create a multi-tenant environment. The caveat is that you're not going to want to do full table/multiple partition scans in Cassandra to retrieve that data. It's pretty well documented as to why, but anytime you have a highly distributed environment, you will want to minimize the amount of network hops, data shuffling, etc. Can't fight physics :)
Anyways, it sounds like this is reporting type of use case - you're going to need to use something like Spark or some type of map and reduce to efficiently report on multiple partitions like this.

Related

Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field?

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

Co-coordinator pressure using IN query on single partition key with 9000 records 4MB size per partiton size

I have 1000 partitions per table and cust_id is partition key and bucket_id and timestamp are the cluster keys.
Every hour one bucket_id and timestamp entry are recorded per cust_id.
Each day 24 * 1 = 24 rows will be recorded per partiton.
One year approx 9000 records per partion.
Partion size is 4MB approx.
---> 20 nodes Cassandra cluster single DC and RF=3
I want to select random five buckets for last 90 days data using IN query.
select cust_id,bucket_id,timestamp from customer_data where
cust_id='tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' and
bucket_id IN (0,2,5,7,8)
and timestamp >='2020-03-01 00:00:00' and
timestamp <='2020-06-01 00:00:00';
Please confirm, does this approach cause any issues with coordinator pressure and query timeouts?
How much data can a coordinator bear and return data without any issue?
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
If I run same kind of query for 10 Mil customers, does this affect coordinator pressure? Does it increase the chances to get a read timeout error?
It's could be hard to get definitive yes/no answer to these questions - there are some unknowns in them. For example, what version of Cassandra, how much memory is allocated for instance, what disks are used for data, what compaction strategy is used for a table, what consistency level do you use for reading the data, etc.
Overall, on the recent versions of Cassandra and when using SSDs, I won't expect problems with that, until you have hundreds of items in the IN list, especially if you're using consistency level LOCAL_ONE and prepared queries - all drivers use token-aware load balancing policy by default, and will route request to the node that holds the data, so it will be both coordinator & data node. Use of other consistency levels would put more pressure to the coordinating node, but it still should work quite good. The problem with read timeouts could start if you use HDDs, and overall incorrectly size the cluster.
Regarding the 10Mil customers - in your query you're doing select by partition key, so query is usually sent to a replica directly (if you use prepared statements). To avoid problems you shouldn't do IN for partition key column (cust_id in your case) - if you do queries for individual customers, driver will spread queries over the whole cluster & you'll avoid increased pressure on coordinator nodes.
But as usual, you need to test your table schema & cluster setup to prove this. I would recommend to use NoSQLBench - benchmark/load testing tool that was recently open sourced by DataStax - it was built for quick load testing of cluster and checking data models, and incorporates a lot of knowledge in area of performance testing.
Please try to ask one question per question.
Regarding the how much a coordinator node can handle, Alex is correct in that there are several factors which contribute to that.
Size of the result set.
Heap/RAM available on the coordinator node.
Network consistency between nodes.
Storage config (spinning, SSD, NFS, etc).
Coordinator pressure will vary widely based on these parameters. My advice, is to leave all timeout threshold settings at their defaults. They are there to protect your nodes from becoming overwhelmed. Timeouts are Cassandra's way of helping you figure out how much it can handle.
How (internally) does an IN query scan the records on Cassandra? Please provide any detailed explanation.
Based on your description, the primary key definition should look like this:
PRIMARY KEY ((cust_id),bucket_id,timestamp)
The data will be stored on disk by partition, and sorted by the cluster keys, similar to this (assuming ascending order on bucket_id and descending order on timestamp:
cust_id bucket_id timestamp
'tlCXP5oB0cE2ryjgvvCyC52thm9Q11KJsEWe' 0 2020-03-02 04:00:00
2020-03-01 22:00:00
1 2020-03-27 16:00:00
2 2020-04-22 05:00:00
2020-04-01 17:00:00
2020-03-05 22:00:00
3 2020-04-27 19:00:00
4 2020-03-27 17:00:00
5 2020-04-12 08:00:00
2020-04-01 12:00:00
Cassandra reads through the SSTable files in that order. It's important to remember that Cassandra reads sequentially off disk. When queries force it to perform random reads, that's when things may start to get a little slow. The read path has structures like partition offsets and bloom filters which help it figure out which files (and where inside them) have the data. But within a partition, it will need to scan clustering keys and figure out what to skip and what to return.
Depending on how many updates these rows have taken, it's important to remember that the requested data may stretch across multiple files. Reading one file is faster than reading more than one.
At the very least, you're forcing it to stay on one node by specifying the partition key. But you'll have to test how much a coordinator can return before causing problems. In general, I wouldn't specify double digits of items in an IN clause.
In terms of optimizing file access, Jon Haddad (now of Apple) has a great article on this: Apache Cassandra Performance Tuning - Compression with Mixed Workloads It focuses mainly on the table compression settings (namely chunk_length_in_kb) and has some great tips on how to improve data access performance. Specifically, the section "How Data is Read" is of particular interest:
We pull chunks out of SSTables, decompress them, and return them to the client....During the read path, the entire chunk must be read and decompressed. We’re not able to selectively read only the bytes we need. The impact of this is that if we are using 4K chunks, we can get away with only reading 4K off disk. If we use 256KB chunks, we have to read the entire 256K.
The point of this ^ relevant to your question, is that by skipping around (using IN) the coordinator will likely read data that it won't be returning.

Will Cassandra be useful for this scenerio

I have around 10 Million names, combination of about 5 files each consisting of 2 million names and there are 100s of users. Each user comes with a file which has 1Million numbers.
I need to process these 1 million numbers against 2 million names and generate the values and show the values with names to the User.
Will cassandra be a good choice to make?
Currently I'm using SQL with RoR but it's quite slow in returning the values.
Cassandra is a no-sql database and not rdbms.
So, if you don't know, then there is no joins in cassandra.
So, if your problem is slow returning data because of IO, then definately cassandra is a good choice.
However, if your result is coming slow because of join, then cassandra cannot help you.
Because, like i said, there is no join in cassandra.
Now coming to your requirement.
It needs more information to frame opinion for that, like,
when do you want to process data to create value (as batch, on the fly).
How many data records you want to pull and show to user at a time, etc.

What is the best data model for timeseries in Cassandra when *fast sequential reads* are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.
If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

Performance - Table Service, SQL Azure - insert. Query speed on large amount of data

I'd read many posts and articles about comparing SQL Azure and Table Service and most of them told that Table Service is more scalable than SQL Azure.
http://www.silverlight-travel.com/blog/2010/03/31/azure-table-storage-sql-azure/
http://www.intertech.com/Blog/post/Windows-Azure-Table-Storage-vs-Windows-SQL-Azure.aspx
Microsoft Azure Storage vs. Azure SQL Database
https://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/2fd79cf3-ebbb-48a2-be66-542e21c2bb4d
https://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
https://stackoverflow.com/questions/2711868/azure-performance
http://vermorel.com/journal/2009/9/17/table-storage-or-the-100x-cost-factor.html
Azure Tables or SQL Azure?
http://www.brentozar.com/archive/2010/01/sql-azure-frequently-asked-questions/
https://code.google.com/p/lokad-cloud/wiki/FatEntities
Sorry for http, I'm new user >_<
But http://azurescope.cloudapp.net/BenchmarkTestCases/ benchmark shows different picture.
My case. Using SQL Azure: one table with many inserts, about 172,000,000 per day(2000 per second). Can I expect good perfomance for inserts and selects when I have 2 million records or 9999....9 billion records in one table?
Using Table Service: one table with some number of partitions. Number of partitions can be large, very large.
Question #1: is Table service has some limitations or best practice for creating many, many, many partitions in one table?
Question #2: in a single partition I have a large amount of small entities, like in SQL Azure example above. Can I expect good perfomance for inserts and selects when I have 2 million records or 9999 billion entities in one partition?
I know about sharding or partition solutions, but it is a cloud service, is cloud not powerfull and do all work without my code skills?
Question #3: Can anybody show me benchmarks for quering on large amount of datas for SQL Azure and Table Service?
Question #4: May be you could suggest a better solution for my case.
Short Answer
I haven't seen lots of partitions cause Azure Tables (AZT) problems, but I don't have this volume of data.
The more items in a partition, the slower queries in that partition
Sorry no, I don't have the benchmarks
See below
Long Answer
In your case I suspect that SQL Azure is not going work for you, simply because of the limits on the size of a SQL Azure database. If each of those rows you're inserting are 1K with indexes you will hit the 50GB limit in about 300 days. It's true that Microsoft are talking about databases bigger than 50GB, but they've given no time frames on that. SQL Azure also has a throughput limit that I'm unable to find at this point (I pretty sure it's less than what you need though). You might be able to get around this by partitioning your data across more than one SQL Azure database.
The advantage SQL Azure does have though is the ability to run aggregate queries. In AZT you can't even write a select count(*) from customer without loading each customer.
AZT also has a limit of 500 transactions per second per partition, and a limit of "several thousand" per second per account.
I've found that choosing what to use for your partition key (PK) and row key depends (RK) on how you're going to query the data. If you want to access each of these items individually, simply give each row it's own partition key and a constant row key. This will mean that you have lots of partition.
For the sake of example, if these rows you were inserting were orders and the orders belong to a customer. If it was more common for you to list orders by customer you would have PK = CustomerId, RK = OrderId. This would mean to find orders for a customer you simply have to query on the partition key. To get a specific order you'd need to know the CustomerId and the OrderId. The more orders a customer had, the slower finding any particular order would be.
If you just needed to access orders just by OrderId, then you would use PK = OrderId, RK = string.Empty and put the CustomerId in another property. While you can still write a query that brings back all orders for a customer, because AZT doesn't support indexes other than on PartitionKey and RowKey if your query doesn't use a PartitionKey (and sometimes even if it does depending on how you write them) will cause a table scan. With the number of records you're talking about that would be very bad.
In all of the scenarios I've encountered, having lots of partitions doesn't seem to worry AZT too much.
Another way you can partition your data in AZT that is not often mentioned is to put the data in different tables. For example, you might want to create one table for each day. If you want to run a query for last week, run the same query against the 7 different tables. If you're prepared to do a bit of work on the client end you can even run them in parallel.
Azure SQL can easily ingest that much data an more. Here's a video I recorded months ago that show a sample (available on GitHub) that shows one way you can do that.
https://www.youtube.com/watch?v=vVrqa0H_rQA
here's the full repo
https://github.com/Azure-Samples/streaming-at-scale/tree/master/eventhubs-streamanalytics-azuresql

Resources