PrestoDB v0.125 SELECT only returns subset of Cassandra records

PrestoDB v0.125 SELECT only returns subset of Cassandra records - cassandra

SELECT statements in PrestoDB v0.125 with a Cassandra connector to a Datastax Cassandra cluster only return 200 rows, even where table contains many more rows than that. Aggregate queries like SELECT COUNT() over the same table also return a result of just 200.
(This behaviour is identical when querying with pyhive connector & with base presto CLI).
Documentation isn't much help, but am guessing that the issue is pagination & a need to set environment variables (which the documentation doesn't explain):
https://prestodb.io/docs/current/installation/cli.html
Does anyone know how I can remove this limit of 200 rows returned? What specific environment variable setting do I need?

For those who come after - the solution is in the cassandra.properties connector configuration for presto. The key setting is:
cassandra.limit-for-partition-key-select
This needs to be set higher than the total number of rows in the table you are querying, otherwise select queries will respond with only a fraction of the stored data (not having located all of the partition keys).
Complete copy of my config file (which may help!):
connector.name=cassandra
# Comma separated list of contact points
cassandra.contact-points=host1,host2
# Port running the native Cassandra protocol
cassandra.native-protocol-port=9042
# Limit of rows to read for finding all partition keys.
cassandra.limit-for-partition-key-select=2000000000
# maximum number of schema cache refresh threads, i.e. maximum number of parallel requests
cassandra.max-schema-refresh-threads=10
# schema cache time to live
cassandra.schema-cache-ttl=1h
# schema refresh interval
cassandra.schema-refresh-interval=2m
# Consistency level used for Cassandra queries (ONE, TWO, QUORUM, ...)
cassandra.consistency-level=ONE
# fetch size used for Cassandra queries
cassandra.fetch-size=5000
# fetch size used for partition key select query
cassandra.fetch-size-for-partition-key-select=20000

Related

what is the impact of limit in cassandra cql

When executing a cqlsh query like select * from table limit 10, would cassandra scan the entire table and just return the first 10 records, or it can precisely locate the first 10 records across whole datacenter without scanning the entire table?

The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.
As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!

LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?

As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date

To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

How to determine cassandra partition for a given PK on the client?

I'm trying Cassandra to replace mysql at a large dataset I have (2.5Tb/5 billion rows) that I can't scale more in a single server.
I insert/update a few million rows every hour. Currently, I'm inserting and querying one by one in cassandra because I don't know which partition has the data, and grouping them seem to be slower. But one by one, I can't match the speed of a single mysql server even with 3 cassandra nodes.
In mysql, I can batch because I know it stores all in the same server. Is it possible, using the value of the primary key, to determine the partition on client side, so I can group the queries more effectively with BATCH or SELECT..IN?
I mean, given a group of PKs like 1, 2, 3, 4, 5, 6 ... and N servers, i'd like to know that say, rows 1 3, 5 are in the same partition, so I can group then in my queries. Is this possible with cassandra?

If you're performing queries with WHERE on partition key, then most of time drivers take care of most effective routing of data to replicas that have this data (only if you didn't change load balancing policy - by default all drivers use so-called TokenAware policy) by calculating token for given partition key, and find replica(s) for it.
If you need to fetch multiple entries, then running N queries in parallel via async API & merging results on client side will be more effective than performing query with IN.
P.S. In Cassandra BATCH has slightly different semantic than in relational databases. Please check this documentation for recommended patterns.

How to read only all row keys in cassandra efficiently...

Accessing all rows from all nodes in cassandra would be inefficient. Is there a way to have some access to index.db which already has row keys? is something of this sort supported in built in cassandra?

There is no way to get all keys with one request without reaching every node in the cluster. There is however paging built-in in most Cassandra drivers. For example in the Java driver: https://docs.datastax.com/en/developer/java-driver/3.3/manual/paging/
This will put less stress on each node as it only fetches a limit amount of data each request. Each subsequent request will continue from the last, meaning you will touch every result for the request you're making.
Edit: This is probably what you want: How can I get the primary keys of all records in Cassandra?

One possible option could be querying all the token ranges.
For example,
SELECT distinct <partn_col_name> FROM <table_name> where token(partn_col_name) >= <from_token_range> and token(partn_col_name) < <to_token_range>
With above query, you can get the all the partition keys available within given token range. Adjust token ranges depending on execution time.

Select All Performance in Cassandra

I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar

If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.

Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string