Fetchsize for large result - cassandra

I'm using the Datastax Java driver and have a partition key that has around 750000 items that I'd like to iterate over. I currently hit a ReadTimeoutException. Will setting Statement#setFetchSize(2000) be all I need to do to avoid the timeout (assuming I have memory in my client, that is)? Or will I need to do the paging myself manually?

Assuming you are using the driver with protocol v2 or higher, this should be all you need. Automatic paging will occur under the hood, returning up to 2000 rows at a time.

Related

Increase request timeout for CQL from NiFi

I am using QueryCassandra processor in NiFi to fetch data from Cassandra but my query is getting timedoutexception. I want to increase the request time out while running the CQL query from the processor. Is there a way to do that or I will have to write a custom processor?
Most probably you're getting an exception because you're performing query on non-partition key - in this case, the query is distributed to the all nodes, and requires to go through all available data, and this is very slow if you have big data set.
In Cassandra queries are fast only when you're performing them on (at least) partition key. If you need to search on non-partition column, then you need to re-model your tables to match your queries. I recommend to take DS220 course on DataStax Academy for better understanding how Cassandra works.
As #Alex ott said, it is not recommended to query on non partition key. If you still want to do so and increase the timeout for the query, just property Max Wait Time to whatever timeout you want.
EDIT:
tl;dr: Apache's timeout wrapper doesn't really let you use the timeout option.
Now that you mentioned that this is a DataStax exception and not java.util.concurrent.TimeoutException, I can tell you that I've looked into QueryCassandra processor's source code and it seems like Apache just wrapped the query function with a Future to achieve a timeout instead of using DataStax's built-in timeout option. This results in a default non-changeable timeout by the DataStax driver. It should be reported to Apache as a bug.

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Hazelcast 3.6.1 Aggregation

I am using Hazelcast 3.6.1. It is set up as a server/client. A Map is on the server (single node) and it is about 4Gb of data. My program creates a client and then needs to look up some data (very small in size - like 30MB). So I was getting the data from the map and looping through all of it to search for the data of interest - before I knew it the process size was 4Gb (as I did a get on the map for each piece of data I was analyzing it was loading it into memory (Lazy) until all the data was loaded!). So, I discovered that I could use aggregation which I was under the impression was all done server side and only the part I was interested in was returned to the client, but the client process still grows to 350MB!
Is aggregation solely done on the server?
Thanks
First of all you should upgrade to Hazelcast 3.8.x versions since the new aggregation system is way faster. Apart from that it depends on what you try to aggregate, but if you do real aggregations like sum, min or similar, aggregations is the way to got. The documentation for 3.8.x fast-aggregations is available here: http://docs.hazelcast.org/docs/3.8.3/manual/html-single/index.html#fast-aggregations
After some testing it appears that the collator portion of the aggregator is being done on the client.

cassandra pagination: the difference between driver and CQL

I am reading driver pagination in here. but CQL also support LIMIT in WHERE clause. Wonder what is the difference between these two
Pagination is how much you of your result you work with at a time.
WHERE and LIMIT are about what is in your result.
Imagine you request all rows where X < 100. This may refer to 1 million different rows. If the client or the server requested all of this at once it would cause a lot of resource pressure. To avoid this the driver is capable of asking for just a few rows at a time. This allows the client and the server to work with a stream of items rather than allocating space for everything up front.

any way to do an asynchronous put with astyanax for cassandra or hector?

I would like to stream some files in and out of cassandra since we already use it rather than setting up a full hadoop distributed filesystem. Is there any asynchronous puts in atyanax or hector that I provide a callback for when it is complete so I can avoid the 1 ms network delays for 1000 calls as I write 1000 entries(split between a few rows and colums as well so it is streamed to a few servers in parallel and then all the responses/callbacks come back when done streaming). Does Hector or astyanax support this?
It looks like astyanax supports a query callback so I think I can get with the primary keys to stream the file back with astyanax?
thanks,
Dean
Cassandra doesn't actually support streaming via the thrift API. Furthermore, breaking up the file into a a single mutation batch that spreads data across multiple row and columns can be very dangerous. That could result in blowing the heap on cassandra or you may also run into the 1MB socket write buffer limit which under certain error cases can actually cause your thrift connection to hang indefinitely (although I think this may be fixed in the latest version of cassandra).
The new chunked object store recipe in Astyanax (https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) builds on our experience at Netflix with storing large objects in Cassandra and provides a simple API that handles all the chunking and parallelization for you. It could still make 1000's of calls to cassandra (depending on your file size and chunk size) but also handles all the retries and parallelization for you. The same goes for reading files. The API will read the chunks and reassemble them in order into an OutputStream.

Resources