Snowflake - Getting Invalid argument types for function '=': error - subquery

I am trying to execute subquery to find physcian_id who have done 'Procedure' for the patient.
Below are the table referred in the query
#event category table
EVENT_NAME CATEGORY
Chemotherapy Procedure
Radiation Procedure
Immunosuppressants Prescription
BTKI Prescription
Biopsy Test
#patient_treatment table
PATIENT_ID EVENT_NAME PHYSICIAN_ID
1 Radiation 1000
2 Chemotherapy 2000
1 Biopsy 1000
3 Immunosuppressants 2000
4 BTKI 3000
5 Radiation 4000
4 Chemotherapy 2000
1 Biopsy 5000
6 Chemotherapy 6000
#physician_speciality table
PHYSICIAN_ID SPECIALITY
1000 Radiologist
2000 Oncologist
3000 Hermatologist
4000 Oncologist
5000 Pathologist
6000 Oncologist
#query tried to find the physcian who have done 'Procedure'
select ph.physician_id from physician_speciality ph where ph.physician_id in (select pt.event_name, pt.physician_id from patient_treatment pt where pt.event_name in (select ec.event_name, ec.category from event_category ec where ec.event_name = pt.event_name and ec.category='Procedure'));
But getting below error :
SQL compilation error
Invalid argument types for function '=': (VARCHAR(50), ROW(VARCHAR(50), VARCHAR(100)))

Can you try this one?
select ph.physician_id from physician_speciality ph where ph.physician_id in (select pt.physician_id
from patient_treatment pt where pt.event_name in
(select ec.event_name from event_category ec where ec.event_name = pt.event_name and ec.category='Procedure'));
You tried to compare one field to two fields:
ph.physician_id in (select pt.event_name, pt.physician_id

Related

How can I filter for a specific date on a CQL timestamp column?

I have a table defined as:
CREATE TABLE downtime(
asset_code text,
down_start timestamp,
down_end timestamp,
down_duration duration,
down_type text,
down_reason text,
PRIMARY KEY ((asset_code, down_start), down_end)
);
I'd like to get downtime on a particular day, such as:
SELECT * FROM downtime \
WHERE asset_code = 'CA-PU-03-LB' \
AND todate(down_start) = '2022-12-11';
I got a syntax error:
SyntaxException: line 1:66 no viable alternative at input '(' (...where asset_code = 'CA-PU-03-LB' and [todate](...)
If function is not allowed on a partition key in where clause, how can I get data with "down_start" of a particular day?
You don't need to use the TODATE() function to filter for a specific date. You can simply specify the date as '2022-12-11' when applying a filter on a CQL timestamp column.
But the difference is that you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Let me illustrate using this example table:
CREATE TABLE tstamps (
id int,
tstamp timestamp,
colour text,
PRIMARY KEY (id, tstamp)
)
My table contains the following sample data:
cqlsh> SELECT * FROM tstamps ;
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-05 11:25:01.000000+0000 | red
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
1 | 2022-12-07 01:48:07.870000+0000 | blue
1 | 2022-12-07 03:13:27.313000+0000 | indigo
The cqlshi client formats the tstamp column into a human-readable date in UTC. But really, the tstamp values are stored as integers:
cqlsh> SELECT tstamp, TOUNIXTIMESTAMP(tstamp) FROM tstamps ;
tstamp | system.tounixtimestamp(tstamp)
---------------------------------+--------------------------------
2022-12-05 11:25:01.000000+0000 | 1670239501000
2022-12-06 02:45:04.564000+0000 | 1670294704564
2022-12-06 11:06:48.119000+0000 | 1670324808119
2022-12-06 19:02:52.192000+0000 | 1670353372192
2022-12-07 01:48:07.870000+0000 | 1670377687870
2022-12-07 03:13:27.313000+0000 | 1670382807313
To retrieve the rows for a specific date, you need to specify the range of timestamps which fall on a specific date. For example, the timestamps for 6 Dec 2022 UTC ranges from 1670284800000 (2022-12-06 00:00:00.000 UTC) to 1670371199999 (2022-12-06 23:59:59.999 UTC).
This means if we want to query for December 6, we need to filter using a range query:
SELECT * FROM tstamps \
WHERE id = 1 \
AND tstamp >= '2022-12-06' \
AND tstamp < '2022-12-07';
and we get:
id | tstamp | colour
----+---------------------------------+--------
1 | 2022-12-06 02:45:04.564000+0000 | yellow
1 | 2022-12-06 11:06:48.119000+0000 | orange
1 | 2022-12-06 19:02:52.192000+0000 | green
WARNING - In your case where the timestamp column is part of the partition key, performing a range query is dangerous because it results in a multi-partition query -- there are 86M possible values between 1670284800000 and 1670371199999. For this reason, timestamps are not a good choice for partition keys. Cheers!
πŸ‘‰ Please support the Apache Cassandra community by hovering over the cassandra tag above and click on Watch tag. πŸ™ Thanks!

Physical Plan and Optimizing a Non-Equi Join in Spark SQL

I am using Spark SQL 2.4.0. I have a couple of tables as below:
CUST table:
id | name | age | join_dt
-------------------------
12 | John | 25 | 2019-01-05
34 | Pete | 29 | 2019-06-25
56 | Mike | 35 | 2020-01-31
78 | Alan | 30 | 2020-02-25
REF table:
eff_dt
------
2020-01-31
The requirement is to select all the records from CUST whose join_dt is <= eff_dt in the REF table. So, for this simple requirement, I put together the following query:
version#1:
select
c.id,
c.name,
c.age,
c.join_dt
from cust c
inner join ref r
on c.join_dt <= r.eff_dt;
Now, this creates a BroadcastNestedLoopJoin in the physical plan and hence the query takes a long time to process this.
Question 1:
Is there a better way to implement this same logic without a BNLJ being induced and execute the query faster? Is it possible to alleviate the BNLJ ?
Part 2:
Now,I broke the query into 2 parts as:-
version#2:
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt = r.eff_dt --equi join
union all
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt < r.eff_dt; --theta join
Now, for the Query in Version#1, the physical plan shows that the CUST table is scanned only once, whereas the physical plan for the Query in Version#2 indicates that the same input table CUST is scanned twice (Once for each of the 2 queries combined with a union). However, I am surprised to find that Version#2 executes faster than version#1.
Question 2:
How does version#2 execute faster than version#1 although version#2 scans the table twice as opposed to once in case of version#1, and also the fact that both the versions induce a BNLJ ?
Can anyone please clarify. Please let me know if additional information is required.
Thanks.

Cassandra how to add values in a single row on every hit

In this table application will feed us with the below data and it will be incremental as and when we will receive updates on the status . So initially table will look like the below as shown:-
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 30 | 10 | 20 |
+---------------+---------------+---------------+---------------+
Now let’s assume total 30 messages are pushed now out of which 10 Failed and 20 Success as shown above.Now again application is run and values changed . Now total 20 new records came in out of which all are success. This should be updated in the same row .
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 50 | 10 | 40 |
+---------------+---------------+---------------+---------------+
Is it feasible in Cassandra DB using Counter data type?
Of course you can use counter tables in your case.
Let's assume table structure like :
CREATE KEYSPACE Test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
CREATE TABLE data (
id int,
data string,
PRIMARY KEY (id)
);
CREATE TABLE counters (
id int,
total_count counter,
failed_count counter,
success_coutn counter,
PRIMARY KEY (id)
);
You can increment counters by running queries like :
UPDATE counters
SET total_count = total_count + 1,
success_count = success_count + 1
WHERE id= 1;
Hope this can help you.

Excel : Get the most frequent value for each group

I Have a table ( excel ) with two columns ( Time 'hh:mm:ss' , Value ) and i want to get most frequent value for each group of row.
for example i have
Time | Value
4:35:49 | 122
4:35:49 | 122
4:35:50 | 121
4:35:50 | 121
4:35:50 | 111
4:35:51 | 122
4:35:51 | 111
4:35:51 | 111
4:35:51 | 132
4:35:51 | 132
And i want to get most frequent value of each Time
Time | Value
4:35:49 | 122
4:35:50 | 121
4:35:51 | 132
Thanks in advance
UPDATE
The first answer of #scott with helper column is the correct one
See the pic
You could use a helper column:
First it will need a helper column so in C I put
=COUNTIFS($A$2:$A$11,A2,$B$2:$B$11,B2)
Then in F2 I put the following Array Formula:
=INDEX($B$2:$B$11,MATCH(MAX(IF($A$2:$A$11=E2,IF($C$2:$C$11 = MAX(IF($A$2:$A$11=E2,$C$2:$C$11)),$B$2:$B$11))),$B$2:$B$11,0))
It is an array formula and must be confirmed with Ctrl-Shift-Enter. Then copied down.
I set it up like this:
Here is one way to do this in MS Access:
select tv.*
from (select time, value, count(*) as cnt
from t
group by time, value
) as tv
where exists (select 1
from (select top 1 time, value, count(*) as cnt
from t as t2
where t.time = t2.time
group by time, value
order by count(*) desc, value desc
) as x
where x.time = tv.time and x.value = tv.value
);
MS Access doesn't support features such as window functions or CTEs that make this type of query easier in other databases.
Would that work? I haven't tried and got inspired here
;WITH t3 AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY time ORDER BY c DESC, value DESC) AS rn
FROM (SELECT COUNT(*) AS c, time, value FROM t GROUP BY time, value) AS t2
)
SELECT *
FROM t3
WHERE rn = 1

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources