Understanding Cassandra, using C/C++ Driver - cassandra

I have an application written in C++ that uses the DataStax C++ Driver to communicate with Cassandra.
I run 20 million inserts and then use 50 queries to read those 20 million rows. I have limited my partition key to 50 different possible values so the number of row partitions is a maximum of 50. Also, each query then returns around 300,000 - 400,000 rows.
I am keeping a track of the wall clock time for different parts of this application. The following piece of code that executes the query and gets result takes on an average 3 seconds to complete, which seems reasonable to me.
stopWatch.start()
CassFuture* result_future = cass_session_execute(session, statement);
if(cass_future_error_code(result_future) == CASS_OK) {
const CassResult* result = cass_future_get_result(result_future);
}
stopWatch.stop()
However, the following piece of code that iterates through the rows takes around 30 seconds on average!
resWatch.start();
CassIterator* rows = cass_iterator_from_result(result);
while(cass_iterator_next(rows)) {
const CassRow* row = cass_iterator_get_row(rows);
BAEL_LOG_INFO << "got a row " << BAEL_LOG_END;
}
resWatch.stop();
I realize that the CassIterator could be iterating over some 400,000 rows but is 30 seconds a reasonable time to achieve this?!
Or is there something I'm missing about the way Cassandra functions... does cass_session_execute(), cass_future_get_result() not fetch all the rows relevant to that query executed and return that to the client? Or does it do it in a lazy manner?

Related

Spark: problem with crossJoin (takes a tremendous amount of time)

First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.
I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.
What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)
I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.

Limit on number of records per page in Cassandra pagination

I am using CassandraPageRequest for fetching data based on page size.
Here is my code:
public CassandraPage<CustomerEntity> getCustomer(int limit, String pagingState)
{
final CassandraPageRequest cassandraPageRequest = createCassandraPageRequest(limit, pagingState);
return getPageOfCustomer(cassandraPageRequest);
}
private CassandraPage<CustomerEntity> getPageOfCustomer(final CassandraPageRequest cassandraPageRequest) {
final Slice<CustomerEntity> recordSlice = CustomerPaginationRepository.findAll(cassandraPageRequest);
return new CassandraPage<>(recordSlice);
}
private CassandraPageRequest createCassandraPageRequest(final Integer limit, final String pagingState) {
final PageRequest pageRequest = PageRequest.of(0, limit);
final PagingState pageState = pagingState != null ? PagingState.fromString(pagingState) : null;
return CassandraPageRequest.of(pageRequest, pageState);
}
This works fine. However I want to know the recommendations on the "number of records per page". When I give 1000 as limit, it works fine. Suggest whether we can give 10000 or more than that for limit.
I work at ScyllaDB - Scylla is a Cassandra compatible database.
I ran an experiment a few years back on the effect of page size and row size on cassandra paging.
What I have found is that the total amount of information that needs to be returned in bytes is the item that really matters. If you have very large rows - even 1000 maybe to much, if you have small rows 10000 should be ok.
Other factor that should be considered are:
Amount of tombstones in your data - tombstones have to be read and skipped in a query searching for live data having many of them will cause cassandra (and scylla) more work in search of the next live row.
Type of query are you doing a range scan over multiple partitions or a single partition - a scan over multiple partitions maybe harder to fill data (especially in the case of alot of tombstones).
Timeout - by increasing the page size - cassandra will have to search for more rows, if the read timeout / range scan timeout values are low the query may timeout.
Please note that Scylla has removed the need for its users to optimize the page size - it will cap your queries to 1MB of data / page size of rows.
You can find the complete slide deck / session searching for "Planning your queries for maximum performance" its old but still holds (in Scylla we have more optimizations :) ).

deduplicating ArangoDB document collection

I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).
I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:
FOR doc IN Documents
COLLECT docId = doc.myId, doc2 = doc
INSERT doc2 IN Documents2
or
FOR doc IN Documents
LET existing = (FOR doc2 IN Documents2
FILTER doc.myId == doc2.myId
RETURN doc2)
UPDATE existing WITH doc IN Documents2
or (this gives me a "violated unique constraint" error)
FOR doc IN Documents
UPSERT {myId: doc.myId}}]}
INSERT doc
UPDATE doc IN Documents2
TL;DR
It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).
Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.
Preparation
I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:
// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
LET randomAttributes = MERGE(
FOR j IN 1..FLOOR(RAND() * 15) + 5
RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
)
INSERT MERGE(randomAttributes, {myid: i}) INTO test1
Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.
Baseline
Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:
FOR doc IN test1
INSERT doc INTO test2
Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:
LET result = (
FOR doc IN test1
COLLECT myid = doc.myid
RETURN 1
)
RETURN LENGTH(result)
Failed grouping
I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.
Grouping with INTO
To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?
FOR doc IN test1
COLLECT myid = doc.myid INTO groups
INSERT groups[0].doc INTO test2
Grouping with subquery
The following query showed a more stable memory consumption, peaking at 13.4GB:
FOR doc IN test1
COLLECT myid = doc.myid
LET doc2 = (
FOR doc3 IN test1
FILTER doc3.myid == myid
LIMIT 1
RETURN doc3
)
INSERT doc2[0] INTO test2
Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.
Grouping with INTO and KEEP
Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():
FOR doc IN test1
LET d = doc._id
COLLECT myid = doc.myid INTO groups KEEP d
INSERT DOCUMENT(groups[0].d) INTO test2
Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!
Grouping with INTO and projection
Instead of KEEP I also tried a projection out of curiosity:
FOR doc IN test1
COLLECT myid = doc.myid INTO groups = doc._id
INSERT DOCUMENT(groups[0]) INTO test2
RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.
Upsert
I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:
FOR doc IN test1
UPSERT {myid: doc.myid}
INSERT doc
UPDATE {} IN test2
It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.
Delete duplicates in-place
I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:
FOR doc IN test2
COLLECT myid = doc.myid INTO keys = doc._key
LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
FOR k IN allButFirst
REMOVE k IN test2
Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).
Verification
The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.
FOR doc IN test2
COLLECT myid = doc.myid WITH COUNT INTO count
COLLECT c = count WITH COUNT INTO cc
RETURN {[c]: cc}
Conclusion
The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).
Can you try my presented solutions on your data and check what the performance is on your machine?

How to improve cassandra 3.0 read performance and throughput using async queries?

I have a table:
CREATE TABLE my_table (
user_id text,
ad_id text,
date timestamp,
PRIMARY KEY (user_id, ad_id)
);
The lengths of the user_id and ad_id that I use are not longer than 15 characters.
I query the table like this:
Set<String> users = ... filled somewhere
Session session = ... builded somewhere
BoundStatement boundQuery = ... builded somewhere
(using query: "SELECT * FROM my_table WHERE user_id=?")
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
The Set of users has aproximately 3000 elements , and each users has aproximately 300 ads.
This code is excecuted in 50 threads in the same machine, (with differents users), (using the same Session object)
The algorithm takes between 2 and 3 seconds to complete
The Cassandra cluster has 3 nodes, with a replication factor of 2. Each node has 6 cores and 12 GB of ram.
The Cassandra nodes are in 60% of their CPU capacity, 33% of ram, 66% of ram (including page cache)
The querying machine is 50% of it's cpu capacity, 50% of ram
How do I improve the read time to less than 1 second?
Thanks!
UPDATE:
After some answers(thank you very much), I realized that I wasn' t doing the queries in parallel, so I changed the code to:
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.collect(toList())
.stream()
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
So now the queries are being done in parrallel, this gave me times of aprox 300 milliseconds, so great improvement there!.
But my question continues, can it be faster?
Again, thanks!
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
A remark. On the 2nd map() you're calling ResultSetFuture::getUninterruptibly. It's a blocking call so you don't benefit much from asynchronous exec ...
Instead, try to transform a list of Futures returned by the driver (hint: ResultSetFuture is implementing the ListenableFuture interface of Guava) into a Future of List
See: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html#successfulAsList(java.lang.Iterable)

Cassandra datastax driver ResultSet sharing in multiple threads for fast reading

I've huge tables in cassandra, more than 2 billions rows and increasing. The rows have a date field and it is following date bucket pattern so as to limit each row.
Even then, I've more than a million entries for a particular date.
I want to read and process rows for each day as fast as possible. What I am doing is that getting instance of com.datastax.driver.core.ResultSet and obtain iterator from it and share that iterator across multiple threads.
So, essentially I want to increase the read throughput. Is this the correct way? If not, please suggest a better way.
Unfortunately you cannot do this as is. The reason why is that a ResultSet provides an internal paging state that is used to retrieve rows 1 page at a time.
You do have options however. Since I imagine you are doing range queries (queries across multiple partitions), you can use a strategy where you submit multiple queries across token ranges at a time using the token directive. A good example of this is documented in Paging through unordered partitioner results.
java-driver 2.0.10 and 2.1.5 each provide a mechanism for retrieving token ranges from Hosts and splitting them. There is an example of how to do this in the java-driver's integration tests in TokenRangeIntegrationTest.java#should_expose_token_ranges():
PreparedStatement rangeStmt = session.prepare("SELECT i FROM foo WHERE token(i) > ? and token(i) <= ?");
TokenRange foundRange = null;
for (TokenRange range : metadata.getTokenRanges()) {
List<Row> rows = rangeQuery(rangeStmt, range);
for (Row row : rows) {
if (row.getInt("i") == testKey) {
// We should find our test key exactly once
assertThat(foundRange)
.describedAs("found the same key in two ranges: " + foundRange + " and " + range)
.isNull();
foundRange = range;
// That range should be managed by the replica
assertThat(metadata.getReplicas("test", range)).contains(replica);
}
}
}
assertThat(foundRange).isNotNull();
}
...
private List<Row> rangeQuery(PreparedStatement rangeStmt, TokenRange range) {
List<Row> rows = Lists.newArrayList();
for (TokenRange subRange : range.unwrap()) {
Statement statement = rangeStmt.bind(subRange.getStart(), subRange.getEnd());
rows.addAll(session.execute(statement).all());
}
return rows;
}
You could basically generate your statements and submit them in async fashion, the example above just iterates through the statements one at a time.
Another option is to use the spark-cassandra-connector, which essentially does this under the covers and in a very efficient way. I find it very easy to use and you don't even need to set up a spark cluster to use it. See this document for how to use the Java API.

Resources