Case Insensitive group by in Presto

Case Insensitive group by in Presto - presto

By default, the Presto performs case sensitive group by. But I wanted to know how to do case insensitive group by. One method is convert all the things in the column to lower case and then perform group by ie
select * from ( select lower(name_of_the_column)), other_columns from table)
where conditions..
group by name_of_the_column
One way we can reduce time is by putting the conditions in the select statment inside the brackets. Is there any better method?

You don't need to push lower(...) into a subquery. If you simply write:
SELECT lower(name_of_the_column), ...
FROM ...
GROUP BY lower(name_of_the_column) -- or just "GROUP BY 1"
Presto will do the conversion to lowercase only once for each row (not twice).

Related

Combine UNION and WITH statement on Azure stream analytics

I try to combine different sources with the UNION statement on Azure Stream Analytics.
In general, this works fine:
SELECT
date
, value
FROM source1
UNION
SELECT
date
, value
FROM source2
But now I need some calculations which require a WITH statement so I hoped this would work:
SELECT
date
, value
FROM source1
UNION
(WITH tempTab AS (
SELECT
date
, value
FROM source2
SELECT
date
, value
FROM tempTab
)
(I'm aware that the example for this WITH statement is completely stupid but let's assume I have real world scenario where it is necessary. Let's further assume the WITH statement works on its own, i.e. if I omit the lines from the first select until after UNION)
In this version I get a notification that there is a syntax error near the "WITH" statements. Is there a way to solve the syntax error and make the WITH and UNION statement work together on Stream Analytics?

With current ASA syntax/semantics,unlike T-SQL, the WITH clause is only allowed to appear first in the query.
You can only do "with step1 as (...), step2 as (...), ..." followed by select clauses using any of step1,step2,... as sources in the from clauses.
union can then be used either in select clauses after the WITH, or inside individual step definitions.

Select where multiple fields are not in subquery (excluding join)

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).

I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null

if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.

Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.

In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Hive query string case

Is there a way to get all the types of string cases while doing this:
select count(word) from table where word="abcd"
Actually when doing this, it is not the same as this:
select count(word) from table where word="ABCD"

Ignoring the case in a where clause is very simple. You can, for example, convert both sides of the comparison to all caps notation:
SELECT COUNT(word)
FROM table
WHERE UPPER(word)=UPPER('ABCD')
Regardless of the capitalization used for the search term , the UPPER function makes them match as desired.

select count(word) from table where lower(word)="abcd"
However this assumes it's not a partitioned table. If it's partitioned by word you would start doing a full table scan because of the "lower("

SELECT count(word) FROM table
WHERE word RLIKE
"(?i)WOrd1|wOrd2"

FireBird: combine upper with primary key constraint

I want to use case insensivity in more tables which came from other DB where the fields and indexes can be case insensitive.
This means that we can search the needed row in any string format (DAta, Data, data, etc.), we can find that by any of these keys.
I tried to use upper function with index, and use this in a primary key to preserve the program logic.
But I failed with it. I didn't find any valid SQL statement to define it.
Maybe it's an impossible mission?
Or you know which ways I define Primary Key with "upper" index?
Thanks for any info!

If you want to do case insensitive search you're supposed to use case insensitive collation. In case you always want to treat the field's value in case insensitive manner you should define it at the field level, ie
CREATE TABLE T (
Foo VARCHAR(42) CHARACTER SET UTF8 COLLATE UNICODE_CI,
...
)
but you can also specify the collation at the search like
SELECT * FROM T WHERE Foo = 'bar' COLLATE UNICODE_CI
Read more about available collations at the Firebird's language reference.

IMHO better way is to use index by expresion
create index idx_upper on persons computed by (upper(some_name))
sql queries
select * from persons order by upper(some_name);
select * from persons where upper(some_name) starting with 'OBAM';
will use index idx_upper

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Case Insensitive group by in Presto - presto

You don't need to push lower(...) into a subquery. If you simply write: SELECT lower(name_of_the_column), ... FROM ... GROUP BY lower(name_of_the_column) -- or just "GROUP BY 1" Presto will do the conversion to lowercase only once for each row (not twice).

Related

Combine UNION and WITH statement on Azure stream analytics

Select where multiple fields are not in subquery (excluding join)

Cassandra Allow filtering

Hive query string case

FireBird: combine upper with primary key constraint

Categories

Resources