Select where multiple fields are not in subquery (excluding join) - subquery

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).

I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null

if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

Related

Cassandra Allow filtering

I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

how to calculate unique stores selling products in powerpivot

I'm not overly experienced with DAX, and my boss has asked me for some metrics which seem past my capability. Specifically, she wants to know in what percentage of the stores any item is being sold at.
We have a Stores table which is in a one to many relationship with the measures table relating by [STORE_ID]. The items table is also related 1-to-M to the measures table by [ITEM_ID]. the measures table contains which indicates among other things, the weekly [Sold] of an item.
My current logic has been to separate this problem into two more simple parts, (stores selling product)/ (total stores). finding the total stores which is an easy distinctcount in the stores table, the The next is more difficult. I tried Stores_Selling_Product = countrows(filter(filter(Measures, earlier(measures[ITEM_ID]) = measures[ITEM_ID]), EARLIER(Measures[STORE_ID]) <> Measures[STORE_ID])), but I found that only excluded the stores that matched the current store ID. Is there a way to exclude stores that have already been counted?
if it helps there is also a binary cell [SoldInStore] which is 1 if the item is sold in that store else 0. I'm pretty certain I'm going to have to use something other than countrows, but I don't know where to look. Any advice would super nice. Thanks in advance,
-Mudkip.
AllStores:=
COUNTROWS( ALL( 'Stores' ) )
StoresWithSales:=
COUNTROWS(
CALCULATETABLE(
'Stores'
,'Measures'
)
)
%StoresSelling:=
DIVIDE( [StoresWithSales], [AllStores] )
The first measure will always return the total number of stores in the 'Stores' table, regardless of any filter context from the pivot table.
The second utilizes the magic of cross table filtering to filter 'Stores' to only those rows that have at least one corresponding row in 'Measures' based on whatever filter context currently exist on 'Measures'. Think of it as essentially the same as the following in SQL:
SELECT
COUNT(1)
FROM Stores
WHERE EXISTS (SELECT
COUNT(1)
FROM Measures
WHERE Measures.StoreKey = Stores.StoreKey
AND <whatever filter context exists on Measures>
);

get_range_slices and CQL query handling, need for ALLOW FILTERING

I have a following CQL table (a bit simplified for clarity):
CREATE TABLE test_table (
user uuid,
app_id ascii,
domain_id ascii,
props map<ascii,blob>,
PRIMARY KEY ((user), app_id, domain_id)
)
The idea is that this table would contain many users (i.e. rows, say, dozens of millions). For each user there would be a few domains of interest and there would be a few apps per domain. And for each user/domain/app there would be a small set of properties.
I need to scan this entire table and load its contents in chunks for given app_id and domain_id. My idea was to use the TOKEN function to be able to read the whole data set in several iterations. So, something like this:
SELECT props FROM test_table WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
I was assuming that this query would be efficient because I specify the range of the row keys and by specifying the values of the clustering keys I effectively specify the column range. But when I try to run this query I get the error message "Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING".
I have limited experience with Cassandra and I assumed that this sort of query would map into get_range_slices() call, which accepts the slice predicate (i.e. the range of columns defined by my app_id/domain_id values) and the key range defined by my token range. It seems either I misunderstand how this sort of query is handled or maybe I misunderstand about the efficiency of get_range_slices() call.
To be more specific, my questions are:
- if this data model does make sense for the kind of query I have in mind
- if this query is expected to be efficient
- if it is efficient, then why am I getting this error message asking me to ALLOW FILTERING
My only guess about the last one was that the rows that do not have the given combination of app_id/domain_id would need to be skipped from the result.
--- update ----
Thank for all the comments. I have been doing more research on this and there is still something that I do not fully understand.
In the given structure what I am trying to get is like a rectangular area from my data set (assuming that all rows have the same columns). Where top and the bottom of the rectangle is determined by the token range (range) and the left/right sides are defined as column range (slice). So, this should naturally transform into get_range_slices request. My understanding (correct me if I am wrong) that the reason why CQL requires me to put ALLOW FILTERING clause is because there will be rows that do not contain the columns I am looking for, so they will have to be skipped. And since nobody knows if it will have to skip every second row or first million rows before finding one that fits my criteria (in the given range) - this is what causes the unpredictable latency and possibly even timeout. Am I right? I have tried to write a test that does the same kind of query but using low-level Astyanax API (over the same table, I had to read the data generated with CQL, it turned out to be quite simple) and this test does work - except that it returns keys with no columns where the row does not contain the slice of columns I am asking for. Of course I had to implement some kind of simple paging based on the starting token and limit to fetch the data in small chunks.
Now I am wondering - again, considering that I would need to deal with dozens of millions of users: would it be better to partially "rotate" this table and organize it in something like this:
Row key: domain_id + app_id + partition no (something like hash(user) mod X)
Clustering key: column partition no (something like hash(user) >> 16 mod Y) + user
For the "column partition no"...I am not sure if it is really needed. I assume that if I go with this model I will have relatively small number of rows (X=1000..10000) for each domain + app combination. This will allow me to query the individual partitions, even in parallel if I want to. But (assuming the user is random UUID) for 100M users it will result in dozens or hundreds of thousands of columns per row. Is it a good idea to read one such a row in one request? It should created some memory pressure for Cassandra, I am sure. So maybe reading them in groups (say, Y=10..100) would be better?
I realize that what I am trying to do is not what Cassandra does well - reading "all" or large subset of CF data in chunks that can be pre-calculated (like token range or partition keys) for parallel fetching from different hosts. But I am trying to find a pattern that is the most efficient for such a use case.
By the way, the query like "select * from ... where TOKEN(user)>X and TOKEN(user)
Short answer
This warning means that Cassandra would have to read non-indexed data and filter out the rows that don't satisfy the criteria. If you add ALLOW FILTERING to the end of query, it will work, however it will scan a lot of data:
SELECT props FROM test_table
WHERE app_id='myapp1'
AND domain_id='mydomain1'
AND TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807
ALLOW FILTERING;
Longer explanation
In your example primary key consists of two parts: user is used as partition key, and <app_id, domain_id> form remaining part. Rows for different users are distributed across the cluster, each node responsible for specific range of token ring.
Rows on a single node are sorted by the hash of partition key (token(user) in your example). Different rows for single user are stored on a single node, sorted by <app_id, domain_id> tuple.
So, the primary key forms a tree-like structure. Partition key adds one level of hierarchy, and each remaining field of a primary key adds another one. By default, Cassandra processes only the queries that return all rows from the continuos range of the tree (or several ranges if you use key in (...) construct). If Cassandra should filter out some rows, ALLOW FILTERING must be specified.
Example queries that don't require ALLOW FILTERING:
SELECT * FROM test_table
WHERE user = 'user1';
//OK, returns all rows for a single partition key
SELECT * FROM test_table
WHERE TOKEN(user) > -9223372036854775808
AND TOKEN(user) < 9223372036854775807;
//OK, returns all rows for a continuos range of the token ring
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id='myapp1';
//OK, the rows for specific user/app combination
//are stored together, sorted by domain_id field
SELECT * FROM test_table
WHERE user = 'user1'
AND app_id > 'abc' AND app_id < 'xyz';
//OK, since rows for a single user are sorted by app
Example queries that do require ALLOW FILTERING:
SELECT props FROM test_table
WHERE app_id='myapp1';
//Must scan all the cluster for rows,
//but return only those with specific app_id
SELECT props FROM test_table
WHERE user='user1'
AND domain_id='mydomain1';
//Must scan all rows having user='user1' (all app_ids),
//but return only those having specific domain
SELECT props FROM test_table
WHERE user='user1'
AND app_id > 'abc' AND app_id < 'xyz'
AND domain_id='mydomain1';
//Must scan the range of rows satisfying <user, app_id> condition,
//but return only those having specific domain
What to do?
In Cassandra it's not possible to create a secondary index on the part of the primary key. There are few options, each having its pros and cons:
Add a separate table that has primary key ((app_id), domain_id, user) and duplicate the necessary data in two tables. It will allow you to query necessary data for a specific app_id or <app_id, domain_id> combination. If you need to query specific domain and all apps - third table is necessary. This approach is called materialized views
Use some sort of parallel processing (hadoop, spark, etc) to perform necessary calculations for all app/domain combinations. Since Cassandra needs to read all the data anyway, there probably won't be much difference from a single pair. If the result for other pairs might be cached for later use, it will probably save some time.
Just use ALLOW FILTERING if query performance is acceptable for your needs. Dozens of millions partition keys is probably not too much for Cassandra.
Presuming you are using the Murmur3Partitioner (which is the right choice), you do not want to run range queries on the row key. This key is hashed to determine which node holds the row, and is therefore not stored in sorted order. Doing this kind of range query would therefore require a full scan.
If you want to do this query, you should store some known value as a sentinel for your row key, such that you can query for equality rather than range. From your data it appears that either app_id or domain_id would be a good choice, since it sounds like you always know these values when performing your query.

Joining two result sets into one

I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.

Resources