Cassandra datastax driver ResultSet sharing in multiple threads for fast reading - cassandra

I've huge tables in cassandra, more than 2 billions rows and increasing. The rows have a date field and it is following date bucket pattern so as to limit each row.
Even then, I've more than a million entries for a particular date.
I want to read and process rows for each day as fast as possible. What I am doing is that getting instance of com.datastax.driver.core.ResultSet and obtain iterator from it and share that iterator across multiple threads.
So, essentially I want to increase the read throughput. Is this the correct way? If not, please suggest a better way.

Unfortunately you cannot do this as is. The reason why is that a ResultSet provides an internal paging state that is used to retrieve rows 1 page at a time.
You do have options however. Since I imagine you are doing range queries (queries across multiple partitions), you can use a strategy where you submit multiple queries across token ranges at a time using the token directive. A good example of this is documented in Paging through unordered partitioner results.
java-driver 2.0.10 and 2.1.5 each provide a mechanism for retrieving token ranges from Hosts and splitting them. There is an example of how to do this in the java-driver's integration tests in TokenRangeIntegrationTest.java#should_expose_token_ranges():
PreparedStatement rangeStmt = session.prepare("SELECT i FROM foo WHERE token(i) > ? and token(i) <= ?");
TokenRange foundRange = null;
for (TokenRange range : metadata.getTokenRanges()) {
List<Row> rows = rangeQuery(rangeStmt, range);
for (Row row : rows) {
if (row.getInt("i") == testKey) {
// We should find our test key exactly once
assertThat(foundRange)
.describedAs("found the same key in two ranges: " + foundRange + " and " + range)
.isNull();
foundRange = range;
// That range should be managed by the replica
assertThat(metadata.getReplicas("test", range)).contains(replica);
}
}
}
assertThat(foundRange).isNotNull();
}
...
private List<Row> rangeQuery(PreparedStatement rangeStmt, TokenRange range) {
List<Row> rows = Lists.newArrayList();
for (TokenRange subRange : range.unwrap()) {
Statement statement = rangeStmt.bind(subRange.getStart(), subRange.getEnd());
rows.addAll(session.execute(statement).all());
}
return rows;
}
You could basically generate your statements and submit them in async fashion, the example above just iterates through the statements one at a time.
Another option is to use the spark-cassandra-connector, which essentially does this under the covers and in a very efficient way. I find it very easy to use and you don't even need to set up a spark cluster to use it. See this document for how to use the Java API.

Related

Postgresql - IN clause optimization for more than 3000 values

I have an application where the user will be uploading an excel file(.xlsx or .csv) with more than 10,000 rows with a single column "partId" containing the values to look for in database
I will be reading the excel values and store it in list object and pass the list as parameter to the Spring Boot JPA repository find method that builds IN clause query internally:
// Read excel file
stream = new ByteArrayInputStream(file.getBytes());
wb = WorkbookFactory.create(stream);
org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
Iterator<Row> rowIterator = sheet.rowIterator();
while(rowIterator.hasNext()) {
Row row = rowIterator.next();
Cell cell = row.getCell(0);
System.out.println(cell.getStringCellValue());
vinList.add(cell.getStringCellValue());
}
//JPA repository method that I used
findByPartIdInAndSecondaryId(List<String> partIds);
I read in many articles and experienced the same in above case that using IN query is inefficient for huge list of data.
How can I optimize the above scenario or write a new optimized query?
Also, please let me know if there is optimized way of reading an excel file than the above mentioned code snippet
It would be much helpful!! Thanks in advance!
If the list is truly huge, you will never be lightning fast.
I see several options:
Send a query with a large IN list, as you mention in your question.
Construct a statement that is a join with a large VALUES clause:
SELECT ... FROM mytable
JOIN (VALUES (42), (101), (43), ...) AS tmp(col)
ON mytable.id = tmp.col;
Create a temporary table with the values and join with that:
BEGIN;
CREATE TEMP TABLE tmp(col bigint) ON COMMIT DROP;
Then either
COPY tmp FROM STDIN; -- if Spring supports COPY
or
INSERT INTO tmp VALUES (42), (101), (43), ...; -- if not
Then
ANALYZE tmp; -- for good statistics
SELECT ... FROM mytable
JOIN tmp ON mytable.id = tmp.col;
COMMIT; -- drops the temporary table
Which of these is fastest is best determined by trial and error for your case; I don't think that it can be said that one of the methods will always beat the others.
Some considerations:
Solutions 1. and 2. may result in very large statements, while solution 3. can be split in smaller chunks.
Solution 3. will very likely be slower unless the list is truly large.

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Insert is 10 times faster than Update in Cassandra. Is it normal?

In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.

Cassandra DataStax driver: how to page through columns

I have wide rows with timestamp columns. If I use the DataStax Java driver, I can page row results by using LIMIT or FETCH_SIZE, however, I could not find any specifics as to how I can page through columns for a specific row.
I found this post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-3-and-wide-rows-td7594577.html
which explains how I could get ranges of columns based on the column name (timestamp) values.
However, what I need to do is to get ALL columns, I just don't want to load them all into memory , but rather "stream" the results and process a chunk of columns (preferably of a controllable size) at a time until all columns of the row are processed.
Does the DataStax driver support streaming of this kind? and of so - what is the syntax for using it?
Additional clarification:
Essentially, what I'm looking for is an equivalent of the Hector's ColumnSliceIterator using which I could iterate over all columns (up to Integer.MAX_VALUE number) of a specific row in batches of, say, 100 columns at a time as following:
SliceQuery sliceQuery = HFactory.createSliceQuery(keySpace, ...);
sliceQuery.setColumnFamily(MY_COLUMN_FAMILY);
sliceQuery.setKey(myRowKey);
// columns to be returned. The null value indicates all columns
sliceQuery.setRange(
null // start column
, null // end column
, false // reversed order
, Integer.MAX_VALUE // number of columns to return
);
ColumnSliceIterator iter = new ColumnSliceIterator(
sliceQuery // previously created slice query needs to be passed as parameter
, null // starting column name
, null // ending column name
, false // reverse
, 100 // column count <-- the batch size
);
while (iter.hasNext()) {
String myColumnValue = iter.next().getValue();
}
How do I do the exact same thing using the DataStax driver?
thanks!
Marina
The ResultSet Object that you get is actually setup to do this sort of paginating for you by default. Calling one() repeatedly or iterating using the iterator() will allow you to access all the data without calling it all into memory at once. More details are available in the api.

How to update multiple rows using Hector

Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.

Resources