How to check the size of the column in cassandra - cassandra

I want to calculate how much data we are storing on each column per row key.
I want to check the size of the column and number of keys/rows. Can any one help me how to do that?

cfstats will give you an estimated number of keys, and cfhistograms will tell you the number of columns/cells and size of a row/partition (Look for "Row/partition Size" and "Column/Cell Count")

Depends a lot on the accuracy required. The histograms from jmx are estimates that could give you a rough idea of what the data looks like. A map reduce job might be the best way to calculate exact column sizes.
What I would recommend is when you insert your column you also insert the size of the data you are storing in another CF/column. Then depending on how you store it (which you would change based on how you want to query it) you could do things like find the largest columns and such.

Related

SSAS Tabular Model - Size Reduction

I try to apply some optimization to reduce the overall size of my Tabular model.
In many article, we can find that the best solution is to remove the unnecessary columns and split columns with high cardinality into two or more columns.
I focused on the second hint.
After some change, the size of my data is even bigger and I'm don't know why. I use VertiPaq to analyze metrics.
before change (table size 4463282609 rows)
sar_Retail cardinality 718621 and size 224301336 B
After change
sar_Retail_main cardinality 1663 and size 89264048 B
sar_Retail_fraction cardinality 10001 and size 302518208 B
As you see the total size of new columns need more space ( 167480920 B)
I split column by this statement:
,ROUND(sar_Retail, 0) sar_Retail_main
,ROUND(sar_Retail, 4) - ROUND(sar_Retail, 0) sar_Retail_fraction
It would be helpful if you could provide the .vpax outputs from Vertipaq analyzer (before and after column split).
I am not sure which datatypes you are using on Tabular side but if you need to store the numbers with only 4 decimal precision you should definitely go with Currency/Fixed decimal type. It allows excatly 4 decimals and it's internally stored as integer multiplied by 10 000. It saves a lot of space compared to the float data type. You can try to use it without splitting the column and see the impact.
Also I recommend checking how run length encoding works. Pre-sorting the data based on least changing columns can reduce the table size quite a lot. Ofcourse it might slow down the processing time

Spark moving average

I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

How to find the size of a row in a table in databse(Sybase)

I wanted to know the command to find the size of a row in a table in my database. Say I've a database db and a table table.
How can i find the size of a row in that database (which includes all the columns)?
For Displaying the expected row size for a table See here
In ASE, I use a combination of sp_estspace and sp_spaceused for total Data / index sizing that I wrote myself. You can grab the source for each inside sybsystemprocs.
sp_estspace tends to overestimate (by a lot) the size of the data (not indexes) and sp_spaceused divided by rowcount tends to not be a good indicator of how big a row could potentially be today (imagine you added 20 nullable columns yesterday and decide to use them, spaceused is unchanged).
reasonable expected data size = ((spaceused / rowcount) + estspace(1) ) / 2
I haven't done any analysis about the accuracy of either index commands, but I would imagine (spaceused / rowcount) would be very accurate for forward looking items.
It's not perfect by ANY means, but it's been a fairly reliable estimate for my purpose. I wouldn't write any code that would break if it exceeded any of these estimates, though.

Resources