Is there a way to get the size of the blob value in cassandra with CQL? Better yet, is there a way to get an average size of the blob column, especially with a condition?
Thanks!
I don't know of a way to do that in CQL. I assume you are interested in the original uncompressed size of the blob rather than the compressed size within Cassandra. I'd suggest adding an integer field to the table and store the size of the blob in it when you originally save the blob.
If you use that integer field as a clustering column, then you could do a range query on it to get the rows that have blobs of a certain size range. To get the average size of the blob's in a range, you could use CQL to retrieve the size column, then use java/python/etc. to calculate the average of the returned values.
Related
Can i do logical Query inside Blob column field in cassandra Query ?
like i have a file inside Blob field called purchase amount : 500$ i want to do search and fetch results purchase amount which is greater than 500$.
is there way i can do this logical search inside my blob.
No, it's not possible out of box. For Cassandra, blob type is just a set of bytes. You can potentially use user-defined functions to extract necessary data, but it could be tricky from performance standpoint.
P.S. I feel that Cassandra may not be correct product for you if you need to search by substring or something like this. In Cassandra you need to model your data based on queries, and then select column types, etc.
I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
CASSANDRA-8099
I need to store coordinates in Azure and had intended on using table storage.
My idea was to be able to query for a subset of coordinates based on two coordinates e.g:
So my query (I think) would be, give me all the points where
The latitude is less than 53.360238 and greater than 53.344204
The Longitude is greater than -6.276734 and less than -6.250122
I had originally thought about saving them as:
ParititonKey, RowKey
"16.775833,-3.009444", "Timbuktu"
...
But realised I would end up with thousands of partitions. I assumed that this would be really bad for doing a query as I would have to touch many partitions possibly on different networks.
Also I'm not sure how it would work given a partition / row query is a string comparison..
I was wondering if there was a better way to store the points, for example I was thinking something like:
ParititonKey, RowKey, Title
16.775833,-3.009444, "Timbuktu"
...
This makes the query easier but doesn't solve the unique partition problem e.g
Get all entites where partition key is less than X and greater than Y AND where RowKey is greater than A and smaller than B
Is there a more efficient way to do this, perhaps by saving the whole number of the latitude as the partition key and the remainder in the RowKey?
ParititonKey, RowKey, Title
16, 775833^-3.009444, "Timbuktu"
...
Any advice is appreciated!
My suggestion would be to use DocumentDb to store this kind of unstructured data and you can easily write SQL like queries on more than one field.
Table storage is built more for key value pairs only
I read that Apache Cassandra supported maximum size of a row is 64KB. But I need to save a record with a size of 560 KB. Is that possible.
Yes, it is possible if you store data in a column value instead of column key.
In Cassandra, the 64KB limitation is only for column keys, which determine the ordering of data in a partition. For column values, the size limitation is 2GB.
This page describes the difference between clustering columns (aka column keys) and regular columns (aka column values).
I want to calculate how much data we are storing on each column per row key.
I want to check the size of the column and number of keys/rows. Can any one help me how to do that?
cfstats will give you an estimated number of keys, and cfhistograms will tell you the number of columns/cells and size of a row/partition (Look for "Row/partition Size" and "Column/Cell Count")
Depends a lot on the accuracy required. The histograms from jmx are estimates that could give you a rough idea of what the data looks like. A map reduce job might be the best way to calculate exact column sizes.
What I would recommend is when you insert your column you also insert the size of the data you are storing in another CF/column. Then depending on how you store it (which you would change based on how you want to query it) you could do things like find the largest columns and such.