how would you tell if a table in Cassandra is active or not?
I have done some research and from what I see, it looks like you can only tell if a table is present or not in Cassandra.
But is there a way to tell if the table is active in Cassandra?
Alex is spot-on. Most enterprises use something like Prometheus or Grafana to visualize JMX metrics, and track reads or writes per second. That's an easy way to see if the table is active.
Otherwise, you can always run nodetool tablestats keyspace.table. In the last 10 lines or so of the output, look for:
Average live cells per slice (last five minutes):
Maximum live cells per slice (last five minutes):
If you have any current read activity, these will be something other than zero or NaN (not a number).
For example, you can query different table-level metrics via JMX to see if specific table is used or not. You will need to detect changes between observations, maybe based on LiveDiskSpaceUsed metric, or something like that.
Related
I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!
I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.
I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.
We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.
I want to calculate how much data we are storing on each column per row key.
I want to check the size of the column and number of keys/rows. Can any one help me how to do that?
cfstats will give you an estimated number of keys, and cfhistograms will tell you the number of columns/cells and size of a row/partition (Look for "Row/partition Size" and "Column/Cell Count")
Depends a lot on the accuracy required. The histograms from jmx are estimates that could give you a rough idea of what the data looks like. A map reduce job might be the best way to calculate exact column sizes.
What I would recommend is when you insert your column you also insert the size of the data you are storing in another CF/column. Then depending on how you store it (which you would change based on how you want to query it) you could do things like find the largest columns and such.