cassandra aggregate query timeout - cassandra

I am new to Cassandra and running User-defined Aggregate on a Cassandra 3-node cluster on local machine.
Issue is that when i am running this aggregate on a smaller data set, result is fine and as expected.
But when data is too large, query fails with error -
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.exe
cute_async'}, last_host=127.0.0.1
I found bellow questions similar to my issue but those are not answered. Find link to Other questions -
How to set a timeout and throttling rate for a large user defined aggregate query
Cassandra CQLSH OperationTimedOut error=Client request timeout. See Session.execute[_async](timeout)
I have modified cassandra.yaml and time limits are -
read_request_timeout_in_ms: 555000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
But this did not help me. Please guide what is the correct configuration for these timings in order to run the same query on large data set without query-timeout.
Aggregate code -
CREATE FUNCTION countSessions(datamap map<text,int>,host text)
RETURNS NULL ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java as
'
Integer countValue = (Integer)datamap.get(host);
if(countValue == null) {
countValue = 1;
} else {
countValue++;
} datamap.put(host,countValue);
return datamap;
';
CREATE OR REPLACE AGGREGATE hostaggregate(text)
SFUNC countSessions
STYPE map<text, int>
INITCOND {};
Thanks and regards,
Vibhav
PS - If anybody chooses to down-vote this question, please do mention the reason for the same in comments.

Related

Cassandra-driver Client.batch() gives RangeError

This code
const cassandra = require('cassandra-driver');
const Long = require('cassandra-driver').types.Long;
const client = new cassandra.Client({
contactPoints: ['localhost:9042'],
localDataCenter: 'datacenter1',
keyspace: 'ks'
});
let q =[]
const ins_q = 'INSERT INTO ks.table1 (id , num1, num2, txt, date) VALUES (?,33,44,\'tes2\',toTimeStamp(now()));'
for (let i = 50000000003n; i < 50000100003n; i++) {
q.push({query: ins_q, params: [Long.fromString(i.toString(),true)]})
}
client.batch(q, { prepare: true }).catch(err => {
console.log('Failed %s',err);
})
is causing this error
Failed RangeError [ERR_OUT_OF_RANGE]: The value of "value" is out of range. It must be >= 0 and <= 65535. Received 100000
at new NodeError (f:\node\lib\internal\errors.js:371:5)
at checkInt (f:\node\lib\internal\buffer.js:72:11)
at writeU_Int16BE (f:\node\lib\internal\buffer.js:832:3)
at Buffer.writeUInt16BE (f:\node\lib\internal\buffer.js:840:10)
at FrameWriter.writeShort (f:\node\test\node_modules\cassandra-driver\lib\writers.js:47:9)
at BatchRequest.write (f:\node\test\node_modules\cassandra-driver\lib\requests.js:438:17)
Is this a bug? I tried execute() with one bigint the same way and there was no problem.
"cassandra-driver": "^4.6.3"
Failed RangeError [ERR_OUT_OF_RANGE]: The value of "value" is out of range. It must be >= 0 and <= 65535. Received 100000
Is this a bug?
No, this is Cassandra protecting the cluster from running a large batch and crashing one or more nodes.
While you do appear to be running this on your own machine, Cassandra is first and foremost a distributed system. So it has certain guardrails built in to prevent non-distributed things from causing problems. This is one of them.
What will happen here, is that the driver looks at the id and figures out real fast that a single node isn't responsible for all of the different possible values of id. So, it sends the batch of 100k statements to one node picked as the "coordinator." That coordinator "coordinates" retrieving each partition of data from all nodes in the cluster, and assembles the result set.
Or rather, it'll try to, but probably time-out before getting through even 1/5th of a batch this size. Remember, BATCH with Cassandra was built to really only run 5 or 6 write operations to keep 5 or 6 tables in-sync; not 100k write operations to the same table.
The way to approach this scenario, is to execute each write operation individually. If you want to optimize the process, make each write operation asynchronous with a listenable future. Run only a certain number of async threads at a time, block on their completion, and then run the next set of threads. Repeat this process until complete.
In short, there are many nuances about Cassandra that are different from a relational database. The use and implementation of BATCH writes being one of them.
Why does it cause a range error?
Because of this part in the error message:
It must be >= 0 and <= 65535
The Cassandra Node.js driver will not allow a batch to exceed 65535 statements. By the looks of it, it is being sent 100000 statements.

Limit on number of records per page in Cassandra pagination

I am using CassandraPageRequest for fetching data based on page size.
Here is my code:
public CassandraPage<CustomerEntity> getCustomer(int limit, String pagingState)
{
final CassandraPageRequest cassandraPageRequest = createCassandraPageRequest(limit, pagingState);
return getPageOfCustomer(cassandraPageRequest);
}
private CassandraPage<CustomerEntity> getPageOfCustomer(final CassandraPageRequest cassandraPageRequest) {
final Slice<CustomerEntity> recordSlice = CustomerPaginationRepository.findAll(cassandraPageRequest);
return new CassandraPage<>(recordSlice);
}
private CassandraPageRequest createCassandraPageRequest(final Integer limit, final String pagingState) {
final PageRequest pageRequest = PageRequest.of(0, limit);
final PagingState pageState = pagingState != null ? PagingState.fromString(pagingState) : null;
return CassandraPageRequest.of(pageRequest, pageState);
}
This works fine. However I want to know the recommendations on the "number of records per page". When I give 1000 as limit, it works fine. Suggest whether we can give 10000 or more than that for limit.
I work at ScyllaDB - Scylla is a Cassandra compatible database.
I ran an experiment a few years back on the effect of page size and row size on cassandra paging.
What I have found is that the total amount of information that needs to be returned in bytes is the item that really matters. If you have very large rows - even 1000 maybe to much, if you have small rows 10000 should be ok.
Other factor that should be considered are:
Amount of tombstones in your data - tombstones have to be read and skipped in a query searching for live data having many of them will cause cassandra (and scylla) more work in search of the next live row.
Type of query are you doing a range scan over multiple partitions or a single partition - a scan over multiple partitions maybe harder to fill data (especially in the case of alot of tombstones).
Timeout - by increasing the page size - cassandra will have to search for more rows, if the read timeout / range scan timeout values are low the query may timeout.
Please note that Scylla has removed the need for its users to optimize the page size - it will cap your queries to 1MB of data / page size of rows.
You can find the complete slide deck / session searching for "Planning your queries for maximum performance" its old but still holds (in Scylla we have more optimizations :) ).

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

How to improve cassandra 3.0 read performance and throughput using async queries?

I have a table:
CREATE TABLE my_table (
user_id text,
ad_id text,
date timestamp,
PRIMARY KEY (user_id, ad_id)
);
The lengths of the user_id and ad_id that I use are not longer than 15 characters.
I query the table like this:
Set<String> users = ... filled somewhere
Session session = ... builded somewhere
BoundStatement boundQuery = ... builded somewhere
(using query: "SELECT * FROM my_table WHERE user_id=?")
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
The Set of users has aproximately 3000 elements , and each users has aproximately 300 ads.
This code is excecuted in 50 threads in the same machine, (with differents users), (using the same Session object)
The algorithm takes between 2 and 3 seconds to complete
The Cassandra cluster has 3 nodes, with a replication factor of 2. Each node has 6 cores and 12 GB of ram.
The Cassandra nodes are in 60% of their CPU capacity, 33% of ram, 66% of ram (including page cache)
The querying machine is 50% of it's cpu capacity, 50% of ram
How do I improve the read time to less than 1 second?
Thanks!
UPDATE:
After some answers(thank you very much), I realized that I wasn' t doing the queries in parallel, so I changed the code to:
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.collect(toList())
.stream()
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
So now the queries are being done in parrallel, this gave me times of aprox 300 milliseconds, so great improvement there!.
But my question continues, can it be faster?
Again, thanks!
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
A remark. On the 2nd map() you're calling ResultSetFuture::getUninterruptibly. It's a blocking call so you don't benefit much from asynchronous exec ...
Instead, try to transform a list of Futures returned by the driver (hint: ResultSetFuture is implementing the ListenableFuture interface of Guava) into a Future of List
See: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html#successfulAsList(java.lang.Iterable)

Cassandra timeout during read query at consistency ONE

I have a problem with the cassandra db and hope somebody can help me. I have a table “log”. In the log table, I have inserted about 10000 rows. Everything works fine. I can do a
select * from
select count(*) from
As soon I insert 100000 rows with TTL 50, I receive a error with
select count(*) from
Version: cassandra 2.1.8, 2 nodes
Cassandra timeout during read query at consistency ONE (1 responses
were required but only 0 replica responded)
Has someone a idea what I am doing wrong?
CREATE TABLE test.log (
day text,
date timestamp,
ip text,
iid int,
request text,
src text,
tid int,
txt text,
PRIMARY KEY (day, date, ip)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
That error message indicates a problem with the READ operation. Most likely it is a READ timeout. You may need to update your Cassandra.yaml with a larger read timeout time as described in this SO answer.
Example for 200 seconds:
read_request_timeout_in_ms: 200000
If updating that does not work you may need to tweak the JVM settings for Cassandra. See DataStax's "Tuning Java Ops" for more information
count() is a very costly operation, imagine Cassandra need to scan all the row from all the node just to give you the count. In small amount of rows if works, but on bigger data, you should use another approaches to avoid timeout.
First of all, we have to retrieve row by row to count amount and forgot about count(*)
We should make a several(dozens, hundreds?) queries with filtering by partition and clustering key and summ amount of rows retrieved by each query.
Here is good explanation what is clustering and partition keys In your case day - is partition key, composite key consists from two columns: date and ip.
It most likely impossible to do it with cqlsh commandline client, so you should write a script by yourself. Official drivers for popular programming languages: http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html
Example of one of such queries:
select day, date, ip, iid, request, src, tid, txt from test.log where day='Saturday' and date='2017-08-12 00:00:00' and ip='127.0 0.1'
Remarks:
If you need just to calculate count and nothing more, probably has a sense to google for tool like https://github.com/brianmhess/cassandra-count
If Cassandra refuses to run your query without ALLOW FILTERING that mean query is not efficient https://stackoverflow.com/a/38350839/2900229

Resources