I try to get number of rows in a table. But cassandra turns timeout for this query select count(*) from events ;
I think that my tables is too big so if ı give a timeout value for my query, it turns always timeout cqlsh --request-timeout=200000
Table size 1.3TB. Is there any way to learn How many rows in this table ?
Do not use count(*) to get no of rows.You can use the following link and download the jar file to get the count.
https://github.com/brianmhess/cassandra-count
One solution that can help you find the total rows in Pagination of result.
Please refer to the below doc :
https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
Note : You can also try ALLOW FILTERING for development! But the use of it should avoided as it is very expensive query, might impact the performance of the cassandra.
Related
Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani
DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.
you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)
Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data
You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra
When executing a cqlsh query like select * from table limit 10, would cassandra scan the entire table and just return the first 10 records, or it can precisely locate the first 10 records across whole datacenter without scanning the entire table?
The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.
As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!
LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.
I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!
I want to know how to configure Cassandra to get better READ performances, because when I try to do a SELECT query on a table which has 1M rows I get the timedoutexception.
I've already change the request_timeout_in_ms, add more nodes but still got the same error.
You are querying too many rows at once. You need to query less rows at a time and page through them.
Update:
First query:
select <KEY>,p0001 from eExtension limit 1000;
Repeat:
take the last result from that query:
select <KEY>,p0001 from eExtension where token(<KEY>) > token(<LAST KEY RETURNED FROM PREVIOUS>) limit 1000;
repeat that pattern until done.
Sounds like you're trying to read all 1M rows at once. Don't.
One way to do pagination is to use Cassandra's clients app like Playorm. PlayOrm returns a cursor when you query and as your first page reads in the first 100 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 100 rows again.
Visit this to see the example for cursor and this for all features and more details about playorm
I'm writing a file with user profiles into cassandra with 5M profiles.
My write operation finished sucessfully.
I want to count the number of rows in my column family.
Keyspace keyspaceOperator = HFactory.createKeyspace(KEY_SPACE, cluster);
CqlQuery<String,String,Long> cqlQuery = new CqlQuery<String,String,Long>(keyspaceOperator, se, se, new LongSerializer());
cqlQuery.setQuery("SELECT COUNT(*) FROM up");
QueryResult<CqlRows<String,String,Long>> result = cqlQuery.execute();
System.out.println(result.get().getAsCount());
But the following code prints me always 10000.
What am I doing wrong? And how can I make this operation from cli?
You can't for now. There's a default limit of 10K rows per query. There's an open ticket for this (CASSANDRA-3702) but no fix as of yet.
Only other alternative is to iterate via RangeSlicesQuery. I created a "census" program to count both rows and total columns; here's a version for long types. But, if this is a frequent activity, conventional wisdom seems to be to use a separate counter column to keep track; some discussion here.
You simply need to give a limit that's as large as you want to count. If you don't expect the count ever to go over 1e9, then do
SELECT COUNT(*) FROM up LIMIT 1000000000;
But be aware that COUNT (and RangeSlicesQuery too) are not at all performant, or even meant to be. They're essentially the same as a "sequential scan" in relational db parlance. A counter is a better way to address this sort of problem in a distributed system.
Please refer here for an example that does this.
You can freely use the code. Please note that Astyanax has been branched out of Hector and we are finding that it is a very good Cassandra client in Java.