I try to apply some optimization to reduce the overall size of my Tabular model.
In many article, we can find that the best solution is to remove the unnecessary columns and split columns with high cardinality into two or more columns.
I focused on the second hint.
After some change, the size of my data is even bigger and I'm don't know why. I use VertiPaq to analyze metrics.
before change (table size 4463282609 rows)
sar_Retail cardinality 718621 and size 224301336 B
After change
sar_Retail_main cardinality 1663 and size 89264048 B
sar_Retail_fraction cardinality 10001 and size 302518208 B
As you see the total size of new columns need more space ( 167480920 B)
I split column by this statement:
,ROUND(sar_Retail, 0) sar_Retail_main
,ROUND(sar_Retail, 4) - ROUND(sar_Retail, 0) sar_Retail_fraction
It would be helpful if you could provide the .vpax outputs from Vertipaq analyzer (before and after column split).
I am not sure which datatypes you are using on Tabular side but if you need to store the numbers with only 4 decimal precision you should definitely go with Currency/Fixed decimal type. It allows excatly 4 decimals and it's internally stored as integer multiplied by 10 000. It saves a lot of space compared to the float data type. You can try to use it without splitting the column and see the impact.
Also I recommend checking how run length encoding works. Pre-sorting the data based on least changing columns can reduce the table size quite a lot. Ofcourse it might slow down the processing time
Related
I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.
I have a Spark DataFrame consisting of many double columns that are measurements, but I want a way of annotating each unique row by computing a hash of several other non-measurement columns. This hash results in garbled strings that are highly unique, and I've noticed my dataset size increases substantially when this column is present. How can I sort / lay out my data to decrease the overall dataset size?
I know that the Snappy compression protocol used on my parquet files executes best upon runs of similar data, so I think a sort over the primary key could be useful, but I also can't coalesce() the entire dataset into a single file (it's hundreds of GB in total size before the primary key creation step).
My hashing function is SHA2(128) FYI.
If you have a column that can be computed from the other columns, then simply omit that column before compression, and reconstruct it after decompression.
I want to calculate how much data we are storing on each column per row key.
I want to check the size of the column and number of keys/rows. Can any one help me how to do that?
cfstats will give you an estimated number of keys, and cfhistograms will tell you the number of columns/cells and size of a row/partition (Look for "Row/partition Size" and "Column/Cell Count")
Depends a lot on the accuracy required. The histograms from jmx are estimates that could give you a rough idea of what the data looks like. A map reduce job might be the best way to calculate exact column sizes.
What I would recommend is when you insert your column you also insert the size of the data you are storing in another CF/column. Then depending on how you store it (which you would change based on how you want to query it) you could do things like find the largest columns and such.
I wanted to know the command to find the size of a row in a table in my database. Say I've a database db and a table table.
How can i find the size of a row in that database (which includes all the columns)?
For Displaying the expected row size for a table See here
In ASE, I use a combination of sp_estspace and sp_spaceused for total Data / index sizing that I wrote myself. You can grab the source for each inside sybsystemprocs.
sp_estspace tends to overestimate (by a lot) the size of the data (not indexes) and sp_spaceused divided by rowcount tends to not be a good indicator of how big a row could potentially be today (imagine you added 20 nullable columns yesterday and decide to use them, spaceused is unchanged).
reasonable expected data size = ((spaceused / rowcount) + estspace(1) ) / 2
I haven't done any analysis about the accuracy of either index commands, but I would imagine (spaceused / rowcount) would be very accurate for forward looking items.
It's not perfect by ANY means, but it's been a fairly reliable estimate for my purpose. I wouldn't write any code that would break if it exceeded any of these estimates, though.
I have a table which I think might be better off if it uses different data types for many of the columns. I wish to design some tests to determine the pay off in disc space if these columns are switched to better data types. How can I determine how much disc space a table is taking up in ASE 15.0?
Use sp_spaceused, table, 1. That reports the table and each index separately. DIviding the data space used by the rowtotal will give you one value for the actual row length (not counting fragmentation, which is based on the lock scheme and activity).
Use sp_help table_name. That will give you another value , the intended or average row length. Using the info provided, do the simple arithmetic with the column lengths; then estimate what they will be re the Datatype changes you intend.
Note the variable length columns require 4 additional bytes each.
If a column is Nullable, it is stored as Var Len.
Now create the new table (even temporarily), with the same columns, with the new Datatypes, and repeat (2). This will confirm your estimates.
sp_estspace has a different purpose.
1> sp_spaceused TableName
2> go
name rowtotal reserved data index_size unused
-------------------- ----------- --------------- --------------- --------------- ---------------
TableName 5530288 5975116 KB 5537552 KB 392292 KB 45272 KB
I'm not aware of anything that will give you a breakdown by column though. Using sp_help against the table does give you a list of all the columns, and their Length. I think that indicates the amount of storage the column could use.
There are methods of estimating table size using sp_estspace, but I've never tried these.