Is it possible to increase the size of cell in cassandra? - cassandra

I want to insert a 16MB image with blob type in Cassandra.
However, I noticed that the practical limit on blob size is less than 1 MB.
(The description of blob type is here.)
Except splitting the image into multiple 1MB, I'm wondering if it is possible to increase the size of the cell to handle my requirement.
Thanks a lot.

The 1Mb limit specified in the documentation is a recommendation, not a hard limit. And it's a good recommendation, because otherwise you can get problems with maintenance operations, like, repair, bootstrapping of the new nodes, etc. - I've seen cases (on older Cassandra) when people stored 1Mb blobs, and couldn't add the new data center because bootstrap failed. Nowadays, it shouldn't be a problem, but this recommendation still actual.
Usual recommendation is to store file content on the file system and store metadata, including the file path in Cassandra. By doing that, it's easier to host your images, especially if you're in the cloud - this will be more performant, and cheaper...

Related

ArangoDB Key/Value Model: value maximum size

With regard to the Key/Value model of ArangoDB, does anyone know the maximum size per Value? I have spent hours searching the Internet for this information but to no avail; you would think that this is a classified information. Thanks in advance.
The answer depends on different things, like the storage engine and whether you mean theoretical or practical limit.
In case of MMFiles, the maximum document size is determined by the startup option wal.logfile-size if wal.allow-oversize-entries is turned off. If it's on, then there's no immediate limit.
In case of RocksDB, it might be limited by some of the server startup options such as rocksdb.intermediate-commit-size, rocksdb.write-buffer-size, rocksdb.total-write-buffer-size or rocksdb.max-transaction-size.
When using arangoimport to import a 1GB JSON document, you will run into the default batch-size limit. You can increase it, but appears to max out at 805306368 bytes (0.75GB). The HTTP API seems to have the same limitation (/_api/cursor with bindVars).
What you should keep in mind: mutating the document is potentially a slow operation because of the append-only nature of the storage layer. In other words, a new copy of the document with a new revision number is persisted and the old revision will be compacted away some time later (I'm not familiar with all the technical details, but I think this is fair to say). For a 500MB document is seems to take a few seconds to update or copy it using RocksDB on a rather strong system. It's much better to have many but small documents.

Where does the idea of a 10MB partition size come from?

I'm doing some data modelling for time series data in Cassandra, and I've decided to implement buckets to regulate my partition sizes and maintain reasonable distribution on my cluster.
I decided to bucketise such that my partitions would not exceed a size of 10MB, as I've seen numerous sources that state this as an ideal partition size, but I can't find any information on why 10MB was chosen. On top of this I can't find anything from DataStax or Apache that mentions this soft 10MB limit at all.
Our data can be requested for large periods of time, meaning lots of partitions will be required to service 1 request if the partition sizes remain at 10MB. I'd rather increase the size of the partitions, and have fewer partitions required to service these requests.
Where does this idea of a 10MB partition size come from? Is it still relevant? What would be so bad if my partitions were 20MB in size? Or even 50MB?
With 10MB referenced in so many places, I feel like there must be something to it. Any information would be appreciated. Cheers.
I think that many of these advises are coming from old time, when support for wide partitions weren't very good - it was a lot of pressure on heap when we read data, etc.. Since Cassandra 3.0 the situation heavily improved, but it's still recommended to keep the size on the disk under 100Mb.
For example, DataStax planning guide says in section "Estimating partition size":
a good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB
In recent versions of Cassandra we can go beyond this recommendation, but it still not advised, although it heavily depends on the access patterns. You can find more information in the following blog post, and this video.
I have seen users with 60+Gb partitions - system still works, but the data distribution is not ideal, so nodes are becoming "hot", and performance may suffer.

Storing pdf files as Blobs in Cassandra table?

I have a task to create a metadata table for my timeseries cassandra db. This metadata table would like to store over 500 pdf files. Each pdf file comprises of 5-10 MB data.
I have thought of storing them as Blobs. Is cassandra able to do that?
Cassandra isn't a perfect for such blobs and at least datastax recommends to keep them smaller than 1MB for best performance.
But - just try for your self and do some testing. Problems arise when partitions become larger and there are updates in them so the coordinator has much work to do in joining them.
A simple way to go is, store your blob separate as uuid key-value pair in its own table and only store the uuid with your data. When the blob is updated - insert a new one with a new uuid and update your records. With this trick you never have different (and maybe large) versions of your blob and will not suffer that much from performance. I think I read that Walmart did this successfully with images that were partly about 10MB as well as smaller ones.
Just try it out - if you have Cassandra already.
If not you might have a look at Ceph or something similar - but that needs it's own deployment.
You can serialize the file and store them as blob. The cost is deserialization when reading the file back. There are many efficient serialization/deserialization libraries that do this efficiently. Another way is to do what #jasim waheed suggested. However, that will result in network io. So you can decide where you want to pay the cost.

Storing media files in Cassandra

I tried to store the audio/video files in the database.
Is cassandra able to do that ? if yes, how do we store the media files in cassandra.
How about storing the metadata and original audio files in cassandra
Yes, Cassandra is definitely able to store files in its database, as "blobs", strings of bytes.
However, it is not ideal for this use case:
First, you are limited in blob size. The hard limit is 2GB size, so large videos are out of the question. But worse, the documentation from Datastax (the commercial company behind Cassandra's development) suggests that even 1 MB (!) is too large - see https://docs.datastax.com/en/cql/3.1/cql/cql_reference/blob_r.html.
One of the reasons why huge blobs are a problem is that Cassandra offers no API for fetching parts of them - you need to read (and write) a blob in one CQL operation, which opens up all sorts of problems. So if you want to store large files in Cassandra, you'll probably want to split them up into many small blobs - not one large blob.
The next problem is that some of Cassandra's implementation is inefficient when the database contains files (even if split up to a bunch of smaller blobs). One of the problems is the compaction algorithm, which ends up copying all the data over and over (a logarithmic number of times) on disk; An implementation optimized for storing files would keep the file data and the metadata separately, and only "compact" the metadata. Unfortunately neither Cassandra nor Scylla implement such a file format yet.
All-in-all, you're probably better off storing your metadata in Cassandra but the actual file content in a different object-store implementation.

Cassandra maximum realistic blob size

I'm trying to evaluate a few distributed storage platforms and Cassandra is one of them.
Our requirement is to save files between 1MB and 50MB of size and according to Cassandra's documentation http://docs.datastax.com/en/cql/3.3/cql/cql_reference/blob_r.html:
The maximum theoretical size for a
blob is 2 GB. The practical limit on blob size, however, is less than
1 MB.
Does anyone have experience storing files in Cassandra as blobs? Any luck with it? Is the performance really bad with bigger file sizes?
Any other suggestion would also be appreciated!
Cassandra was not build for these type of job.
In Cassandra a single column value size can be: 2 GB ( 1 MB is recommended). So If you want to use use cassandra as object storage, split the big object into multiple small object and store them with object id as partition key and bucket id as clustering key.
It is best to use Distributed Object Storage System like OpenStack Object Storage ("Swift")
The OpenStack Object Store project, known as Swift, offers cloud storage software so that you can store and retrieve lots of data with a simple API. It's built for scale and optimized for durability, availability, and concurrency across the entire data set. Swift is ideal for storing unstructured data that can grow without bound.

Resources