Two SQL queries take a lot of time difference in Presto - presto

I deployed a presto cluster, 2 workers node. But two SQL queries take a lot of time difference.
//sql1: it takes 398.12ms
SELECT count(employee_name) from employee where jobstatus=2;
// sql2: it takes 16.58s
SELECT count(employee_name) from employee where create_time > date_parse('2018-12-20','%Y-%m-%d') and create_time < date_parse('2019-12-20','%Y-%m-%d');
I guess sql2 is to load all the data of the employee table into the memory for filtering, and sql1 is directly filtered in the oracle, how to confirm? Or is there another way to locate the cause?
Presto version is 0.147. The employee is Oracle table and has 50w data, of which 36 are jobstatus=2, date_parse('2018-12-20', '%Y-%m-%d') and create_time < date_parse('2019-12-20' , '%Y-%m-%d') has 98. create_time and jobstatus all is not indexed.
No concurrent support during testing, it is sequential execution

If you are connecting Oracle from Presto then those queries will be single threaded (single JDBC connection) from Presto side which means only one worker would be active independent of the cluster size. Hence, whatever performance number you are seeing, that will be coming from the Oracle side.
Based on the given performance number, it seems create_time column is indexed and jobstatus not. Please verify that.

Related

Cassandra query using secondary index timedout

I am facing timeout issue while executing query on Cassandra database. We have tried increasing the read timeout fields "read_request_timeout_in_ms", "range_request_timeout_in_ms" in cassandra.yaml, but still query timesout in 10secs.
Is there anyway we can increase the timeout value to 1-2 mins ?
Sample Product Table Schema:
- product_id string (primary key)
- product_name string
- created_on timestamp (secondary index)
- updated_on timestamp
Requirement: I want to query all the product which are created a particular day using 'created_on' field.
Sample Query: select * from "Product" where created_on > 1632906232 AND created_on < 1632906232
Note: Query uses the secondary index field in filter.
Environment details: Cassandra database with 2 node cluster setup.
The underlying problem is that range queries is expensive which is why it takes so long to complete. By the way, it looks like you posted the wrong query because you have the same value.
The default timeouts are in place to prevent nodes from getting overloaded by expensive queries so they don't go down. Increasing the server-side timeouts is not the right approach. And in your case, it's most likely the client-side timeout getting triggered.
You need to review your data model and create table instead partitioned by the creation date so it will perform better. Cheers!

Is there way in cassandra system tables check the counts ? where we can check the meta data of latest inserts?

i am working on migration tool oracle to cassandra , where I want to maintain a validation table with columns oracle count and cassandra count , so that i can validate the migration job,in cassandra is there any way system maintains the recently executed/inserted query count ? total count of a particular table ? is there anywhere in cassandra system tables does it store? if so what is it ? if not please suggest some way to design validation framework of data migration.
Is there way in cassandra, get the latest query inserted record count and total count of table in any system tables from where we can read the counts instead of executing the count(*) query on the tables ? does cassandra maintains the of the counts anywhere internally ?If so where we can check the meta data of latest inserts i.e which system tables?
Cassandra is distributed system and there is no place where it will collect the counts per tables. You can get some estimates from system.size_estimates, but it will say only paritions count per range, and their sizes.
For such framework as you're asking, you may need to develop custom Spark code (easiest way) that will perform counting of the rows, and other checks. Spark is highly optimized for effective data access and could be more preferable than writing the custom code.
Also, during migration, consider using consistency level greater than ONE to make sure that at least several nodes confirmed writing of the data. Although, it depends on the amount of data & timing requirements for your migration jobs.

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

Resources