How to store mutlimedia files in Cassandra? - cassandra

I'm a newbie to Cassandra. I'm trying to store mutlimedia(photo, video, audio) files in Cassandra using blob. How is do it? there any other alternative to do the same?
Thanks in advance.

It really depends on how large the files are, but large objects and files are more easily stored in MongoDB. With Cassandra you could split the files up into chunks of a smaller size and make a file correspond to a row, with the chunks as column values.

How large are your files? Blobs in cassandra can be in theory 2gb large - but in the real world blobs should be less than a few mb for performance reasons.
You can of course chunk your files up into smaller pieces and reassemble them as needed (ideal while streaming data).
But you can and probably should go polyglot - c* for metadata and some object store as aws s3 or rados on ceph for selfhosting and put just the links to the bulk data to c*.

Related

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Storing pdf files as Blobs in Cassandra table?

I have a task to create a metadata table for my timeseries cassandra db. This metadata table would like to store over 500 pdf files. Each pdf file comprises of 5-10 MB data.
I have thought of storing them as Blobs. Is cassandra able to do that?
Cassandra isn't a perfect for such blobs and at least datastax recommends to keep them smaller than 1MB for best performance.
But - just try for your self and do some testing. Problems arise when partitions become larger and there are updates in them so the coordinator has much work to do in joining them.
A simple way to go is, store your blob separate as uuid key-value pair in its own table and only store the uuid with your data. When the blob is updated - insert a new one with a new uuid and update your records. With this trick you never have different (and maybe large) versions of your blob and will not suffer that much from performance. I think I read that Walmart did this successfully with images that were partly about 10MB as well as smaller ones.
Just try it out - if you have Cassandra already.
If not you might have a look at Ceph or something similar - but that needs it's own deployment.
You can serialize the file and store them as blob. The cost is deserialization when reading the file back. There are many efficient serialization/deserialization libraries that do this efficiently. Another way is to do what #jasim waheed suggested. However, that will result in network io. So you can decide where you want to pay the cost.

Storing media files in Cassandra

I tried to store the audio/video files in the database.
Is cassandra able to do that ? if yes, how do we store the media files in cassandra.
How about storing the metadata and original audio files in cassandra
Yes, Cassandra is definitely able to store files in its database, as "blobs", strings of bytes.
However, it is not ideal for this use case:
First, you are limited in blob size. The hard limit is 2GB size, so large videos are out of the question. But worse, the documentation from Datastax (the commercial company behind Cassandra's development) suggests that even 1 MB (!) is too large - see https://docs.datastax.com/en/cql/3.1/cql/cql_reference/blob_r.html.
One of the reasons why huge blobs are a problem is that Cassandra offers no API for fetching parts of them - you need to read (and write) a blob in one CQL operation, which opens up all sorts of problems. So if you want to store large files in Cassandra, you'll probably want to split them up into many small blobs - not one large blob.
The next problem is that some of Cassandra's implementation is inefficient when the database contains files (even if split up to a bunch of smaller blobs). One of the problems is the compaction algorithm, which ends up copying all the data over and over (a logarithmic number of times) on disk; An implementation optimized for storing files would keep the file data and the metadata separately, and only "compact" the metadata. Unfortunately neither Cassandra nor Scylla implement such a file format yet.
All-in-all, you're probably better off storing your metadata in Cassandra but the actual file content in a different object-store implementation.

Does VoltDB compress data on disk?

I am curious whether VoltDB compresses the data on disk/at rest.
If it does, what is the algorithm used and are there options for 3rd party compression methods (e.g. a loss permitted proprietary video stream compression algorithm)?
VoltDB uses Snappy compression when writing Snapshots to disk. Snappy is an algorithm optimized for speed, but it still has pretty good compression. There aren't any options for configuring or customizing a different compression method.
Data stored in VoltDB (e.g. when you insert records) is stored 100% in RAM and is not compressed. There is a sizing worksheet built in to the web interface that can help estimate the RAM required based on the specific datatypes of the tables, and whatever indexes you may define.
One of the datatypes that is supported is VARBINARY which stores byte arrays, i.e. any binary data. You could store pre-compressed data in VARBINARY columns, or use a third-party java compression library within stored procedures to compress and decompress inputs. There is a maximum size limit of 1MB per column, and 2MB per record, however a procedure could store larger sized binary data by splitting it across multiple records. There is a maximum size of 50MB for the inputs to or the results from a stored procedure. You could potentially store and retrieve larger sized binary data using multiple transactions.
I saw you posted the same question in our forum, if you'd like to discuss more back and forth, that is the best place. We'd also like to talk to you about your solution, so if you like I can contact you at the email address from your VoltDB Forum account.

Resources