How many IN Lists values are supported by Presto? - presto

It seems that Oracle has a limit of 1000. MySQL is not limited.
How about Presto?

Related

What is the best way to get row count from a table in Cassandra?

Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani
DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.
you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)
Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data
You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra

sorl-spark : how to increase reading data speed?

I am using spark-solr to fetch 2 or 3 attributes (id and date attributes) from solr but it takes tens of seconds to fetch hundred thousands documents.
My solr collections have around 10 shards, and each of them have 4 replicas. My collections contains from ten millions documents to hundred millions of documents.
Regarding the lucidworks spark-solr connector, I set rows to 10000 and splits to true.
Is it the expected behavior ? (I mean, is Solr slow when fetching data by essence ?) Or could you help me understand how to configure solr and this lucidworks connector to increase the fetch speed please ? I hardly found answers on the internet.
Thank you for your help :)

Is there way in cassandra system tables check the counts ? where we can check the meta data of latest inserts?

i am working on migration tool oracle to cassandra , where I want to maintain a validation table with columns oracle count and cassandra count , so that i can validate the migration job,in cassandra is there any way system maintains the recently executed/inserted query count ? total count of a particular table ? is there anywhere in cassandra system tables does it store? if so what is it ? if not please suggest some way to design validation framework of data migration.
Is there way in cassandra, get the latest query inserted record count and total count of table in any system tables from where we can read the counts instead of executing the count(*) query on the tables ? does cassandra maintains the of the counts anywhere internally ?If so where we can check the meta data of latest inserts i.e which system tables?
Cassandra is distributed system and there is no place where it will collect the counts per tables. You can get some estimates from system.size_estimates, but it will say only paritions count per range, and their sizes.
For such framework as you're asking, you may need to develop custom Spark code (easiest way) that will perform counting of the rows, and other checks. Spark is highly optimized for effective data access and could be more preferable than writing the custom code.
Also, during migration, consider using consistency level greater than ONE to make sure that at least several nodes confirmed writing of the data. Although, it depends on the amount of data & timing requirements for your migration jobs.

Efficiently numeric storage in Cassandra

I'm storing many small numbers in a Cassandra table with 7.5 billion rows. Many of the numbers can be represented as a tinyint (1 byte), but Cassandra doesn't seem to support any numeric data types which are smaller than 4 bytes. https://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html
My table is about 4 TB and I'm looking to cut down the size. Is varint my answer ("Arbitrary-precision integer")? How is varint represented in memory and what is its smallest size?
Or alternatively, is there a preferred compression configuration that can help this specific case?
You are looking on old version of documentation. Since Cassandra 2.2 smallint and tinyint are supported. See enter link description here
If you are worried about your disk usage, I would recommend to use Cassandra 3.x.

Why MemSQL is slower than SQL Server for Select SQL with Substring Operations on Binary Columns

I have a table with two binary columns used to store strings that are 64 bytes long maximum and two integer columns. This table has 10+ million rows and uses 2GB of memory out 7GB available memory, so there is plenty of available memory left. I also configured MemSQL based on http://docs.memsql.com/latest/setup/best_practices/.
For simple select SQL where binary columns are compared to certain values, MemSQL is about 3 times faster than SQL Server, so we could rule out issues such as configuration or hardware with MemSQL.
For complex SQLs that use
substring operations in the Select clause and
substring and length operations in the where clause
MemSQL is about 10 times slower than SQL Server. The measured performance of these SQLs on MemSQL were taken after the first few runs to make sure that the SQL compilation durations were not included. It looks like MemSQL’s performance issue has to do with how it handles binary columns and substring and string length operations.
Has anyone seen similar performance issues with MemSQL? If so, what were the column types and SQL operations?
Has anyone seen similar performance issues with MemSQL for substring and length operations on varchar columns?
Thanks.
Michael,
My recommendation: go back to rowstore, but with VARBINARY instead of BINARY, consider putting indexes on the columns or creating persisted columns, and try rewriting your predicate with like.
If you paste an example query, I can help you transform it.
The relevant docs are here
dev.mysql.com/doc/refman/5.0/en/pattern-matching.html
docs.memsql.com/4.0/concepts/computed_columns
Good luck.
It's hard to make general answers to perf questions, but in your case I would try a MemSQL columnstore table as opposed to an in-memory rowstore table. Since you are doing full scans anyway, you'll get the benefit of having the column data stacked up right next to each other.
http://docs.memsql.com/4.0/concepts/columnar/

Resources