I have about 28GB of Data-In for a little over 13.5 million rows stored in Windows Azure Table Storage.
6 Columns, all ints except 1 decimal and 1 datetime.
Partition Key is about 10 characters long.
RowKey is a guid.
This is for my sanity check--does this seem about right?
The Sql Database I migrated the data from has WAY more data and is only 4.9GB.
Is there a way to condense the size? I don't suspect renaming properties will put a huge dent on this.
*Note this was only a sampling of data to estimate costs for the long haul.
Well... something doesn't seem to add up right.
Each property is a key/value pair, so include property names in your calculations.
The data itself is probably around 75-100 bytes including property names averaging 10 characters apiece. The 4 ints equate to 16 bytes, the decimal (double?) 8 bytes, and the timestamp 8 bytes. So let's just round up to 100 bytes per entity.
at 14 million entities you'd have 100*13.5 million, or about 1.35 GB.
Your numbers are approx. an order of magnitude larger (about 2,000 bytes per entity). Even accounting for bulk from serialization, I don't see how you're getting such a large size. Just curious: how did you compute the current table size? And... have you done multiple tests, resulting in more data from previous runs? Are you measuring just the table size, or the total storage used in the storage account? If the latter, there may be other tables (such as diagnostics) also consuming space.
Renaming properties in the entities that are persisted should have some impact on the size. Unfortunately, that'll be only for data saved in the future. Existing data does not change just because you've renamed the properties
Related
I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.
I try to apply some optimization to reduce the overall size of my Tabular model.
In many article, we can find that the best solution is to remove the unnecessary columns and split columns with high cardinality into two or more columns.
I focused on the second hint.
After some change, the size of my data is even bigger and I'm don't know why. I use VertiPaq to analyze metrics.
before change (table size 4463282609 rows)
sar_Retail cardinality 718621 and size 224301336 B
After change
sar_Retail_main cardinality 1663 and size 89264048 B
sar_Retail_fraction cardinality 10001 and size 302518208 B
As you see the total size of new columns need more space ( 167480920 B)
I split column by this statement:
,ROUND(sar_Retail, 0) sar_Retail_main
,ROUND(sar_Retail, 4) - ROUND(sar_Retail, 0) sar_Retail_fraction
It would be helpful if you could provide the .vpax outputs from Vertipaq analyzer (before and after column split).
I am not sure which datatypes you are using on Tabular side but if you need to store the numbers with only 4 decimal precision you should definitely go with Currency/Fixed decimal type. It allows excatly 4 decimals and it's internally stored as integer multiplied by 10 000. It saves a lot of space compared to the float data type. You can try to use it without splitting the column and see the impact.
Also I recommend checking how run length encoding works. Pre-sorting the data based on least changing columns can reduce the table size quite a lot. Ofcourse it might slow down the processing time
I have a dataset which has close to 2 billion rows in parquet format which spans in 200 files. It occupies 17.4GB on S3. This dataset has close to 45% of duplicate rows. I deduplicated the dataset using 'distinct' function in Spark, and wrote it to a different location on S3.
I expected the data storage to be reduced by half. Instead, the deduplicated data took 34.4 GB (double of that which had duplicates).
I took to check the metadata of these two datasets. I found that there is a difference in the column encoding of the duplicate and deduplicated data.
Difference in column encodings
I want to understand how to get the expected behavior of reducing the storage size.
Having said that, I have a few questions further:
I also want to understand if this anomaly affect the performance in any way. In my process, I am having to do apply lot of filters on these columns and using distinct function while persisting the filtered data.
I have seen in a few parquet blogs online that encoding for a column is only one. In this case I see more than one column encodings. Is this normal?
I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.
We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.