How to find the size of a row in a table in databse(Sybase) - sap-ase

I wanted to know the command to find the size of a row in a table in my database. Say I've a database db and a table table.
How can i find the size of a row in that database (which includes all the columns)?

For Displaying the expected row size for a table See here

In ASE, I use a combination of sp_estspace and sp_spaceused for total Data / index sizing that I wrote myself. You can grab the source for each inside sybsystemprocs.
sp_estspace tends to overestimate (by a lot) the size of the data (not indexes) and sp_spaceused divided by rowcount tends to not be a good indicator of how big a row could potentially be today (imagine you added 20 nullable columns yesterday and decide to use them, spaceused is unchanged).
reasonable expected data size = ((spaceused / rowcount) + estspace(1) ) / 2
I haven't done any analysis about the accuracy of either index commands, but I would imagine (spaceused / rowcount) would be very accurate for forward looking items.
It's not perfect by ANY means, but it's been a fairly reliable estimate for my purpose. I wouldn't write any code that would break if it exceeded any of these estimates, though.

Related

Spark moving average

I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

"Capped collections" in Cassandra

Cassandra doesn't have capped collections (or row size limits), but one way of simulating it is to use an offline mapreduce job clean up extra entries. Would it be better to have a second table that stores row counts for primary keys in another table? The downside is that you have to scan through the entire row_count table since counters aren't indexable. Or would it be faster to just scan over the backing table with the real data?
Or is there another technique I should look into?
Edit: I found this Columns count vs counter column performance. Row counts go over all the data, so I'm leaning away from that.

How to check the size of the column in cassandra

I want to calculate how much data we are storing on each column per row key.
I want to check the size of the column and number of keys/rows. Can any one help me how to do that?
cfstats will give you an estimated number of keys, and cfhistograms will tell you the number of columns/cells and size of a row/partition (Look for "Row/partition Size" and "Column/Cell Count")
Depends a lot on the accuracy required. The histograms from jmx are estimates that could give you a rough idea of what the data looks like. A map reduce job might be the best way to calculate exact column sizes.
What I would recommend is when you insert your column you also insert the size of the data you are storing in another CF/column. Then depending on how you store it (which you would change based on how you want to query it) you could do things like find the largest columns and such.

MutliGet or multiple Get operations when paging

I have a wide column family used as a 'timeline' index, where column names are timestamps. In order to prevent hotspots, I shard the CF by month so that each month has its own row in the CF.
I query the CF for a slice range between two dates and limit the number of columns returned based on the page's records per page, say to 10.
The problem is that if my date range spans several months, I get 10 columns returned from each row, even if there is 10 matching columns in the first row - thus satisfying my paging requirement.
I can see the logic in this, but it strikes me as a real inefficiency if I have to retrieve redundant records from potentially multiple nodes when I only need the first 10 matching columns regardless of how many rows they span.
So my question is, am I better off to do a single Get operation on the first row and then do another Get operation on the second row if my first call doesnt return 10 records and continue until I have the required no. of records (or hit the row limit), or just accept the redundancy and dump the unneeded records?
I would sample your queries and record how many rows you needed to fetch for each one in order to get your 10 results and build a histogram of those numbers. Then, based on the histogram, figure out how many rows you would need to fetch at once in order to complete, say, 90% of your lookups with only a single query to Cassandra. That's a good start, at least.
If you almost always need to fetch more than one row, consider splitting your timeline by larger chunks than a month. Or, if you want to take a more flexible approach, use different bucket sizes based on the traffic for each individual timeline: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra (see the "Variable Time Bucket Sizes" section).

Resources