Retrieving bucketting value in WITH statement for subsequent SELECT - presto

I have several tables with bucketing applied. It can work great when I specify the bucket/partition parameter upfront in my SELECT query, however when I retrieve the bucket value I need from a different table - within a WITH select statement, Hive/Athena seems to no longer use the optimisation, and searches the entire database instead. I would like to learn if there is a way to write my query properly to maintain the optimisation.
For a simple example, I have two tables:
Table1
category | categoryid
---------+-----------
mass | 1
Table2
categoryid | index | value
-----------+-------+------
1 | 0 | 15
1 | 1 | 10
1 | 2 | 7
The bucketed/clustered column is categoryid. I have a single category ('mass') and would like to retrieve the value's that correspond with the category I have. So I have designed my SELECT like this:
WITH dataset AS (
SELECT categoryid
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid=dataset.categoryid
This will run, but will search the entire database it seems, because Hive doesn't know the categoryid for bucketing before commencing the search? If I swap out the final Table2.categoryid=dataset.categoryid for Table2.categoryid=1 then it will search only the fraction of the db.
So is there some way of writing this query to ensure Hive doesn't search more buckets in the second table than it has to?

Athena is based on Presto. Unless there is some modification in Athena in this area (and I think there currently isn't), this cannot be made to work in single query.
Recommended workaround: issue one query to gather dataset.categoryid values. Pass them as constant to your main query:
WITH dataset AS (
SELECT category
FROM Table1
WHERE category='mass'
)
SELECT index,value
FROM Table2, dataset
WHERE Table2.categoryid = dataset.categoryid
AND Table2.categoryid IN ( <all possible values> );
This is going to be improved with the additional of Dynamic Filtering in Presto, that the Presto Community is working on currently.

Related

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.
I would really not recommend to use time travel for that for following reasons:
If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel

Cassandra DB Query for System Date

I have one table customer_info in a Cassandra DB & it contains one column as billing_due_date, which is date field (dd-MMM-yy ex. 17-AUG-21). I need to fetch the certain fields from customer_info table based on billing_due_date where billing_due_date should be equal to system date +1.
Can anyone suggest a Cassandra DB query for this?
fetch the certain fields from customer_info table based on billing_due_date
transaction_id is primarykey , It is just generated through uuid()
Unfortunately, there really isn't going to be a good way to do this. Right now, the data in the customer_info table is distributed across all nodes in the cluster based on a hash of the transaction_id. Essentially, any query based on something other than transaction_id is going to read from multiple nodes, which is a query anti-pattern in Cassandra.
In Cassandra, you need to design your tables based on the queries that they need to support. For example, choosing transaction_id as the sole primary key may distribute well, but it doesn't offer much in the way of query flexibility.
Therefore, the best way to solve for this query, is to create a query table containing the data from customer_info with a key definition of PRIMARY KEY (billing_date,transaction_id). Then, a query like this should work:
> SELECT * FROM customer_info_by_date
WHERE billing_due_date = toDate(now()) + 2d;
billing_due_date | transaction_id | name
------------------+--------------------------------------+---------
2021-08-20 | 2fe82360-e314-4d5b-aa33-5deee9f03811 | Rinzler
2021-08-20 | 92cb9ee5-dee6-47fe-b372-0829f2e384cd | Clu
(2 rows)
Note that for this example, I am using the system date plus 2 days out. So in your case, you'll want to adjust the "duration" aspect from 2d down to 1d. Cassandra 4.0 allows date arithmetic, so this should work just fine if you are on that version. If you are not, you'll have to do the "system date plus one" calculation on the app side.
Another way to go about this, would be to create a secondary index on billing_due_date, but I don't recommend that path as it will query multiple nodes to build the result set.

Cassandra returns Unordered result set for numeric values

I am new to No SQL and just started learning Cassandra, I have a following question to ask. I have created a simple table with one column to understand Cassandra partition and clustering and trying to query all the values after insertion.
My table structure
create table if not exists music_library(custno int, primary key(custno))
I inserted following values in a sequential order
insert into music_library(custno) values (11)
insert into music_library(custno) values (12)
insert into music_library(custno) values (13)
insert into music_library(custno) values (14)
then I was querying this table
select * from music_library
it returns values in the following order
13
11
14
12
but i was expecting
11
12
13
14
Why its behaving like that?
I ran your exact statements and produced the same result. But I also adjusted your query to run the token function, and this is what it produced:
aaron#cqlsh:stackoverflow> select custno,token(custno) from music_library;
custno | system.token(custno)
--------+----------------------
13 | -5034495173465742853
11 | -4156302194539278891
14 | 4279681877540623768
12 | 8582886034424406875
(4 rows)
Why its behaving like that?
Simply put, because Cassandra cannot order results by the values of the partition keys.
As your table has a single primary key of custno, your rows are partitioned by the hashed token value of custno, and written to the nodes responsible for those token ranges. When you run an unbound query in Cassandra (query without a WHERE clause), the results are returned ordered by the hashed token values of their partition keys.
Using ORDER BY won't work here, either. ORDER BY can only sort data within a partition, and even then only on clustering keys. To get the custno values to order properly, you will need to find a new partition key, and then specify custno as a clustering key in an ascending direction.
Edit 20190916 - follow-up clarifications
Does this tokenization will happen for all the columns?
No. The partition keys are hashed into a token to determine their placement in the cluster (which node(s) they are written to). Individual column values are written within a partition.
How will I return the inserted number with the order?
You cannot alter the order of this table without changing the model. Simply put, you'll have to find a way to organize the values you expect to return (with your query) together (find another partition key). Exactly how that looks depends on your business/query requirements.
For example, let's say that I wanted to track which customers purchased specific music albums. I might create a table that looks like this:
CREATE TABLE customers_by_album (
album TEXT,
band TEXT,
custno INT,
PRIMARY KEY (album,custno))
WITH CLUSTERING ORDER BY (custno ASC);
After inserting some data, the following query returns results ordered by custno:
aaron#cqlsh:stackoverflow> SELECT album,token(album),band,custno FROM
customers_by_album WHERE album='Moving Pictures';
album | system.token(album) | band | custno
-----------------+---------------------+------+--------
Moving Pictures | 7819329704333693835 | Rush | 11
Moving Pictures | 7819329704333693835 | Rush | 12
Moving Pictures | 7819329704333693835 | Rush | 13
Moving Pictures | 7819329704333693835 | Rush | 14
(4 rows)
This works, because I am querying data by a partition (album), and then I am "clustering" on custno which leverages the on-disk sort order. This is also the order the data was written to disk in, so Cassandra just reads it from the partition sequentially.
I wrote an article on this topic for DataStax a few years ago, and it's still quite relevant. Give it a read if you get a chance: https://www.datastax.com/dev/blog/we-shall-have-order

how Cql's Collection contains alternative value?

I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.

Get Date Range for Cassandra - Select timeuuid with IN returning 0 rows

I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?
The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.

Resources