Cassandra: Issue with blob creation for large file - cassandra

We are trying to load a file in to a blob column in Cassandra. When we load files of 1-2 MB files, it goes through fine. While loading large file, say around 50 MB, getting following error:
Cassandra failure during write query at consistency LOCAL_QUORUM (1 responses were required but only 0 replica responded, 1 failed)
It is a single node development DB. Any hints or support will be appreciated.

50mb is pretty big for a cell. Although a little out of date its still accurate: http://cassandra.apache.org/doc/4.0/faq/#can-large-blob
There is no mechanism for streaming out of cells in Cassandra so the cells content needs to be serialized in as single response, in memory. Your probably hitting a limit or bug somewhere thats throwing an exception and causing the failed query (check cassandras system.log, may be an exception in there that will describe whats occuring better).
If you have a CQL collection or logged batch there are additional lower limits.
http://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
You can try chunking your blobs into parts. Id actually recommend like 64kb, and on client side, iterate through them and generate a stream (to also prevent loading it completely in memory on your side).
CREATE TABLE exampleblob (
blobid text,
chunkid int,
data blob,
PRIMARY KEY (blobid, chunkid));
Then just SELECT * FROM exampleblob WHERE blobid = 'myblob'; and iterate through results. Inserting gets more complex though since you have to have logic to split up your file, this can also be done in streaming fashion though and be memory efficient on your app side.
Another alternative is to just upload the blob to S3 or some distributed file store, use a hash of the file as the bucket/filename. In Cassandra just store the filename as a reference to it.

Related

Does using Parquet on S3 with EMR/Spark save bandwidth when using subset of columns?

I have an EMR cluster running Spark. In the first step the CSV files are transformed into paruqet.snappy format partitioned by date column, so I am left with
s3://my-bucket/dataset/date=2020-12-20/part-0001.parquet.snappy
s3://my-bucket/dataset/date=2020-12-20/part-0002.parquet.snappy
s3://my-bucket/dataset/date=2020-12-20/part-0003.parquet.snappy
s3://my-bucket/dataset/date=2020-12-20/part-0004.parquet.snappy
the columns are
id,name,value
A subsequent job processes this data:
df = spark.read.parquet('s3://my-bucket/dataset')
df.registerAsTempView('dataset')
spark.sql('''
select id,
sum(value)
from dataset
where date=2020-12-20
group by 1;
''')
so in the query I am not using the name column. From what I understood about Parquet, the chunks of data corresponding to the column name wouldn't be read from disk at all.
Question:
A) Are all the part-000x parts of the dataset actually downloaded from S3 to the Spark cluster, but only the required columns are loaded into memory (no bandwidth is saved, but there is still the benefit of the columnar format when reading columns to memory)
or
B) Spark can somehow seek() into the files on S3 so that it can only download certain subsections of the parts that correspond to the required columns? (bandwith is saved)
Trying to optimise object store reads for columnar data is a very interesting problem.
I can't speak for the EMR S3 connector, as I haven't seen its code.
But the general read plan of a parquet file on s3 through spark is
file is opened (HEAD. maybe GET)
seek to a few bytes off EOF, read magic file type (safety check) and location of full footer
seek and read footer to say where columns are in file, schema, etc.
Then it does the work with: seek/bulk read of stripes with the columns it wants to query/include.
Exactly which APIs is used depends on library; parquet.jar does readFully(offset) with a block size of 2MB. Back-to-back reads of 2MB blocks are pretty common.
If there's any predicate pushdown, then the stripe summary data after the stripe is read, then either the stripe is skipped, or it is needed, in which case there's a backwards seek/full read of the stripe.
When working with any of the object stores, while bandwidth is limited, dealing with seek "optimally" is actually the things which keeps people busy. Do a full GET from 0-EOF and as soon as the parquet code does a seek() you have to decide whether to read and discard the remaining bytes in the request, or abort the HTTPS connection, GET a new one starting at the new position and take the hit of the cost of a new TLS negotiation.
Random IO seek policies say "do shorter GETs so we can recycle the connection for the next block"; works best for Parquet and ORC, awful for .csv, .avro etc. Then there's "do you want to do a GET of range + "extra", because of those common back-to-back reads. Or even: should you prefetch the next block?
The general consensus in the latest ASF connectors (S3A for S3 and ABFS for Azure Storage is: start off sequential and switch to random IO on the first backwards seek; add a way to change that default. ABFS also does some async prefetching of the next block, which can boost sequential reads and, if not needed, is not too expensive in terms of CPU, network and Azure billing costs.
Oh, and sometimes code does a seek(l1), seek(l2) a few times back to back, so you just remember that offset and only worry about issuing GET requests on the first read().
Like I said: it's complicated. Lots of opportunities to tune things.
If you are using the s3a:// connector, then call .toString() on it and you get the stats on how much data has been read, skipped, discarded, connections aborted, etc etc.
Further reading: s3a connector input stream seek code

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

Storing pdf files as Blobs in Cassandra table?

I have a task to create a metadata table for my timeseries cassandra db. This metadata table would like to store over 500 pdf files. Each pdf file comprises of 5-10 MB data.
I have thought of storing them as Blobs. Is cassandra able to do that?
Cassandra isn't a perfect for such blobs and at least datastax recommends to keep them smaller than 1MB for best performance.
But - just try for your self and do some testing. Problems arise when partitions become larger and there are updates in them so the coordinator has much work to do in joining them.
A simple way to go is, store your blob separate as uuid key-value pair in its own table and only store the uuid with your data. When the blob is updated - insert a new one with a new uuid and update your records. With this trick you never have different (and maybe large) versions of your blob and will not suffer that much from performance. I think I read that Walmart did this successfully with images that were partly about 10MB as well as smaller ones.
Just try it out - if you have Cassandra already.
If not you might have a look at Ceph or something similar - but that needs it's own deployment.
You can serialize the file and store them as blob. The cost is deserialization when reading the file back. There are many efficient serialization/deserialization libraries that do this efficiently. Another way is to do what #jasim waheed suggested. However, that will result in network io. So you can decide where you want to pay the cost.

Cassandra: Storing and retrieving large sized values (50MB to 100 MB)

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)
As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)
Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

Gridgain: Write timed out (socket was concurrently closed)

While trying to upload data to Gridgain using GridDataLoader, I'm getting
'Write timed out (socket was concurrently closed).'
I'm trying to load 10 million lines of data using a .csv file on a cluster having 13 nodes (16 core cpu).
The structure of my GridDataLoader is GridDataLoader where Key is a composite key. While using a primitive data type as the key there was no issue. But when I changed it to a composite key this error is coming.
I guess this is because it takes up too large space on heap when it tries to parse your csv and create entries. As a result, if you don't configure your heap-size large enough, you are likely suffering from GC pauses since when GC kicks in, everything has to pause, and that's why you got this time out error. I think it may help if you can break that large csv into smaller files and load them one by one.

Resources