First preparedStatement using Cassandra always slow - cassandra

I noticed if I have a java method in which I have a preparedStatement uisng the JDBC driver that comes with Cassandra it is always slow. But if I put the same query twice in the method the second time it is 20x faster. Why is that? I would think the second, third, four time I call the java method it would be faster then the first. I am using Cassandra 1.2.5. I have also cached 100MB of rows in the row-cache and set the table to caching = "all". In Cassandra-cli I verified the settings. And in Cassandra-Cli I verified the second, third fourth time I get the rows from the same table I do the JDBC calls against I get faster response time.
Any Ideas?
Thanks,
-Tony

From the all knowing CQL3 documentation (always a great starting point btw):
Prepared statement is an optimization that allows to parse a query only once but execute it multiple times with different concrete values.
The statement gets cached. This is the difference maker you are experiencing. Also prepared statements get pre-compiled, typically meaning an execution plan is prepared before the query is run against the db. Knowing what you are doing makes the process faster.
At the first run your prepared statement is cached in-case you run the same query again, which you do, and since its cached the querying will be executed much faster.

Related

Could my large amount of tables (2k+) be causing my write timeout exceptions?

I'm running OS Cassandra 3.11.9 with Datastax Java Driver 3.8.0. I have a Cassandra keyspace that has multiple tables functioning as lookup tables / search indices. Whenever I receive a new POST request to my endpoint, I parse the object and insert it in the corresponding Cassandra table. I also put inserts to each corresponding lookup table. (10-20 per object)
When ingesting a lot of data into the system, I've been running into WriteTimeoutExceptions in the driver.
I tried to serialize the insert requests into the lookup tables by introducing Apache Camel and putting all the Statements into a queue that the Session could work off of, but it did not help.
With Camel, since the exceptions are now happening in the Camel thread, the test continues to run, instead of failing on the first exception. Eventually, the test seems to crash Cassandra. (Nothing in the Cassandra logs though)
I also tried to turn off my lookup tables and instead insert into the main table 15x per object (to simulate a similar number of writes as if I had the lookup tables on). This test passed with no exception, which makes me think the large number of tables is the problem.
Is a large number (2k+) of Cassandra tables a code smell? Should we rearchitect or just throw more resources at it? Nothing indicative has shown in the logs, mostly just some status about the number of tables etc - no exceptions)
Can the Datastax Java Driver be used multithreaded like this? It says it is threadsafe.
There is a direct effect of the high number of tables onto the performance - see this doc (the whole series is good source of information), and this blog post for more details. Basically, with ~1000 tables, you get ~20-25% degradation of performance.
That's could be a reason, not completely direct, but related. For each table, Cassandra needs to allocate memory, have a part for it in the memtable, keep information about it, etc. This specific problem could come from the blocked memtable flushes, or something like. Check the nodetool tpstats and nodetool tablestats for blocked or pending memtable flushes. It's better to setup some continuous monitoring solution, such as, metrics collector for Apache Cassandra, and and for period of time watch for the important metrics that include that information as well.

Implications of keeping Cassandra ResultSet open for a while

I'm using Cassandra Java driver with a fetch size set to 1k. I need to query all records in a table and perform some time consuming action for a every row.
What will happen if I'll keep the ResultSet open (not fully iterated) for a one day?
What I don't care about:
consistency. If some new record will be written in the meantime, I'm ok to fetch it. However, I'm fine if I won't get it
fault tolerance. If during that process some node will fail, I'm fine if the query will fail too. However, I would like to detect that from the client perspective.
What I care about:
Cassandra resource utilization - I don't want to cause cluster outage due to some blocked resources
lateness - I don't want to block (or slow down much) cluster for other consumers of that table
I would like to get all records which existed when I started the query (assuming no deletions). However, they don't have to be up to date
The paging state is the information about the last read data (literally serialized partition key, clustering, and remaining). When sent to coordinator it will look for everything greater than that. So there are no resources in the server spent for this and no performance impact vs a normal read.
Cassandra does not have any features to allow isolation even within a single query. If data has changed from when the first query was made and the second, you will get the up to date information.

Optimizing a SELECT from Oracle to INSERT into SQLite (in Python3)

I have an Oracle database that I'm making an extract of a subset of the data into a local SQLite database. My code is basically the following:
print(datetime.datetime.now())
#Oracle portion of the script
sql = '''select [columns] from [table] where [condition]'''
oracle_cursor.execute(sql)
print(datetime.datetime.now())
#SQLite portion of the script
sqlite_conn.executeany('''INSERT into [table]
([columns])
values (?,?,?...etc.)''',
oracle_cursor.fetchall()
)
sqlite_conn.commit()
sqlite_conn.close()
It does what need it to, but it takes longer than I would like. The execution of the Oracle portion is actually surprisingly fast at around 3 minutes. But the inserting takes much longer. I've played around with the SQLite settings like buffer settings, etc. Nothing seems to break 50 rows / second. There is a spike in network activity for the first three minutes, but once it prints the second datetime from above, there's no network activity, which leads me to believe the bottleneck is something I've coded. Is my code inefficient at inserting? If so, is there a better way to get what I'm after?
fetchall() loads all data into memory. This is not necessary because executemany can work with any iterator; replace oracle_cursor.fetchall() with oracle_cursor.
Also ensure that you are using a single transaction. (If you have enabled autocommit mode, you should start a transaction explicitly.)

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

Cassandra PreparedStatement vs normal insert

I'm using Cassandra for my project and I was facing a timeout issue during writes, the same the guy was receiving in this post Cassandra cluster with bad insert performance and insert stability (at the moment I'm testing with only one node, Java Driver, last release of Cassandra). The application has to insert a huge quantity of data per user once per day (during nights). I have a rest controller that accepts files and then processes them as they arrive in parallel to insert values in Cassandra. I have to insert 1million entries per user, where an entry has up to 8 values (time is not so important, it can take also 10minutes). Following the answer provided in Cassandra cluster with bad insert performance and insert stability I decided to add executeAsync(), Semaphore and PreparedStatement to my application, while previously I was using none of them.
The problem now is that, using variable keyspaces (one per user) and having the necessity to update lists in the database, I can't initialize my PreparedStatements in the initialization phase but I have to do it at least once per file processed (one file contains 10+k entries) and an user has to upload up to 100 files per day. For this reason, I'm getting this warning:
Re-preparing already prepared query INSERT INTO c2bdd9f7073dce28ed973238ac85b6e5d6162fce.sensorMonitoringLog (timestamp, sensorId, isLogging) VALUES (?, ?, ?). Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
My question is: is it a good practice to use PreparedStatement like this or it is better to use normal insert with executeAsync()?
Thank you
If you are facing a timeout issue during write, it is a good idea to use PreparedStatement but not to use asynchronous insert. Timeouts are a way to prevent Cassandra from work overload. With asynchronism you are giving it more work at the same time and the risk of OOM would grow.
To do things properly with PreparedStatement, you have to create one and only one Session object by keyspace. Then each session must prepare its own statement once.
Moreover, be aware their is a thread safety risk with PreparedStatement and asynchronism. Preparing a statement must be synchronized. But once again, I advice you not to use ExecuteAsynch in such case.

Resources